The Exascale Computing (ECP) project strives to combine two key technologies, LLVM and Continuous Integration (CI), to ensure that current and future compilers are stable and perform well on high performance computing (HPC) and computer systems. exascale. The proliferation of new machine architectures has made continuous software testing and verification (hence “continuous” in CI) an essential part of the US Department of Energy’s DOE supercomputing.
Valentin Clement, software engineer at Oak Ridge National Laboratory who is part of the team working to include LLVM as part of the ECP CI test and verification framework, notes: “We are working to add CI for architectures relevant to ECP. This facilitates collaboration as each DOE lab currently has its own separate LLVM fork. Centralization in a single software fork for all laboratories avoids unnecessary effort. It also means that we can work to support GPUs from multiple vendors in a single LLVM fork, which benefits all DOE sites and the global LLVM community. Additionally, we can work to increase offload support on GPUs, as GPU support is not really well tested in the current version of LLVM upstream.
The importance of the open source LLVM collection of compiler and toolchain technologies cannot be overstated. Compilers that generate good and performing binaries are a decisive technology, which is why the ECP CI framework is so important to the US compute effort. Testing and verification is the only way to make sure a compiler is working. Johannes Doerfert, researcher at the Argonne National Laboratory, observes: “People don’t realize that most vendor compilers are based on LLVM. Collaboration enhancements, along with enhancements to LLVM, benefit all vendor products as well as the entire HPC community.
Clément observes that “LLVM is a huge project. We are one of the first to try a CI fork of LLVM in ECP because CI represents a huge investment in resources. There are many interactions with the different facilities, as well as with several other ECP projects, such as SOLLVE and Flang, which also contribute to the LLVM. “
The scale and impact of the CI task can be seen in Figure 1, which illustrates the breadth of languages and compilers encompassed by the ECP LLVM effort, each supporting large HPC applications on relevant architectures. for ECP. All LLVM and CI work is part of the PROTEAS-TUNE effort managed by Jeffrey Vetter, responsible for ECP LLVM efforts.
Main benefits of the ECP CI effort
Better GPU support is a key benefit of the CI effort. “For example, notes Clément, the integration effort can be very slow. It may take up to 1 year for some code changes to be incorporated into the major version of LLVM.
Another key benefit of the ECP CI effort is that it gives the DOE and HPC communities the ability to focus on specific HPC needs. Many HPC and scientific codes are written in Fortran. Access to the liberal LLVM licensing model means that the HPC community can work to build an efficient, GPU-compatible parallelization Fortran compiler. This eliminates the reliance on commercial companies who no longer see significant commercial demand for a Fortran compiler.
This is not to say that vendors are ignoring ECP Fortran’s development efforts. Clement notes that while the Flang Fortran front-end has received significant investment from DOE labs, vendors (e.g. NVIDIA, ARM, AMD) are participating and contributing to it.
Without powerful compilers capable of generating correct binary code for CPU and GPU architectures, the next generation of exascale supercomputers cannot see the light of day.
Given the ubiquity of LLVM-based compilers, the inclusion of LLVM in the CI ECP infrastructure is a necessary testing and validation step to ensure that reliable and high-performance compilers exist for every DOE supercomputer system. Due to the permissive license terms of the LLVM license, CI also allows the HPC community to quickly identify and fix performance bugs and regressions if they occur at any ECP site, as well as work to advance state-of-the-art compiler technology in a tested and verified manner.
The ECP CI infrastructure is currently testing and verifying a large part of the Extreme-Scale Scientific Software Stack (E4S) software ecosystem. Users can easily download the E4S software ecosystem for assessment and production cycles. You can find more information on the E4S website.
Rob Farber is a global technology consultant and author with extensive experience in HPC and machine learning technology development which he applies in national labs and commercial organizations.