ReDX Technologies

HPC Software Stack Build Automation

Posted: 1 minutes ago

Job Description

Brief description: In high-performance computing (HPC), efficiently managing and deploying software stacks across nodes and clusters is critical to achieving optimal performance. Manual software builds are time-consuming and prone to errors. Automating this process with robust tools ensures better reproducibility and higher productivity. ReDX is working on improving a large scale cluster application software ecosystem on Intel Xeon CPUs and NVIDIA A100 and H100 GPUs. The primary goal of this internship is to automate the build of a software stack for this supercomputer environment. Then, the second goal is to streamline the maintenance of the software stack by automating the identification and cleanup of outdated software versions. The student will gain practical experience in HPC build systems and automation techniques.Current status:The current cluster supports multiple compilers like gcc and intel compiler to ensure compatibility and performance for a variety of applications. A range of applications is also available to support research in multiple scientific domains, as an example, Quantum espresso that is used for electronic structure calculations and OpenFoam which is used for computational fluid dynamics (CFD) simulations. The cluster also offers several libraries like OpenMPI , LAPACK and FFTW and scientific frameworks, pre-compiled and managed through modules. These applications and libraries are installed using EasyBuild that automates software installation and management.PFE project Goals:The goal of the project is to build an HPC software stack for scientific computing and data science on an operational cluster. It is divided into two key components: the first focuses on automating the process using state of the art tools like easybuild or spack whenever possible. The second part focuses on parsing the modulefiles to retrieve the list of installed software, their versions, and compiler information, and dependencies, then generating a color-coded report to highlight which versions should be kept, removed, or reviewed. Set up build automation:Use scripts, Spack, and EasyBuild to configure automated software stack deployment.Comparative study between these tools and the scripting method. Integrate the software modules to provide users with seamless access, to address the dependencies of the applications and to avoid conflicts between modules.Implement version control to track software changes (e.g., Git).Software Stack Inventory and FilteringAutomatically retrieve:The list of available software installed on the Toubkal Supercomputer via modulefiles.The version of each software.The compiler version used for each software build (if available).The dependencies on other modulesFilter this inventory to keep only the last N versions of each software and remove the outdated ones by generating a color-coded report:Red: Software versions to be deleted.White: Software versions to be kept (N or less, if the corresponding software has less than N versions installed).Orange: Versions with deprecated or problematic dependencies.To do that, Candidate should develop a script\software package that:Parses modulefiles in the Supercomputer.Takes as input the number N (how many versions to retain).Outputs a filtered report.Implement a “restore mode” that allows the recovery of previously deleted software versions, particularly those removed due to being outdated. This feature should enable users to selectively restore specific versions and regenerate the report to reflect how many software packages are being rescued from permanent deletion.Required skills:Familiarity with Linux, and good software development skills with C/C++, Github, make/cmake, and other tools Ability to learn parallel programming models such as MPI, OpenMP, CUDA,...Knowledge and experience with Git, CI/CD tools is a plus.Very good English proficiency, well organized, use of project management tools like Clickup.Planned training:Linux fundamentals of the “Complete Linux Training Course” on Udemy platform, the course will be taken in modules starting with the fundamentals. The course has also advanced modules that can be taken during or even after the internship project during employment. A ReDX engineer will be with you along the way for any support required.Introduction to HPC: course with hands-on exercises to learn about HPC systems, and parallel programming using OpenMP, MPI and GPU programming with OpenACC. 1:1 sessions with ReDX engineer, and senior expert as required for the specific tasks of the project.Duration and other details:Recommended period: 6 months.Compensation: Interns will have a monthly stipend with a possibility of an end of internship performance bonus, as well as part or full-time employment.Possibility to work with end customers.

Job Application Tips

  • Tailor your resume to highlight relevant experience for this position
  • Write a compelling cover letter that addresses the specific requirements
  • Research the company culture and values before applying
  • Prepare examples of your work that demonstrate your skills
  • Follow up on your application after a reasonable time period

You May Also Be Interested In