After more than a decade of planning, America’s first exascale computer, Frontier, is expected to arrive at Oak Ridge National Laboratory (ORNL) later this year. Crossing this “1000 x” horizon required overcoming four major challenges: energy demand, reliability, extreme parallelism and data movement.
Al Geist kicked off the ORNL Advanced Technology Section (ATS) webinar series last month by recapping the story of the exascale march. As Geist described how the Frontier supercomputer solved the four main exascale challenges, he disclosed key information about the first planned US exascale computer.
In particular, Frontier is on the verge of reaching the 20 MW power target set by DARPA in 2008 by delivering more than 1.5 peak performance exaflops in a 29 MW power envelope. Although the once-ambitious target was initially set for 2015, until fairly recently it was not clear whether the first crop of exascale supercomputers – expected to arrive in the 2021-2023 period – would make the cut. Indeed, it is not known if they will all do it, but it looks like Frontier, using HPE and AMD technologies, will.
Geist is a corporate member and technical director of the Oak Ridge Leadership Computing Facility (OLCF) and technical director of the Exascale Computing project. He is also one of the original developers of PVM (Parallel Virtual Machine) software, a de facto standard for heterogeneous distributed computing.
Geist began his speech with a review of the four main challenges that were defined in the period 2008-2009, when exascale planning intensified within the Ministry of Energy and its affiliated organizations.
“The four challenges also existed during the petascale regime, but in 2009 we felt there was a serious problem where we might not even be able to build an exascale system,” Geist said. “It wasn’t just that it would be expensive or that it would be difficult to program – it might just be impossible.”
Energy consumption was significant.
“Research papers published in 2008 predicted that an exaflop system would consume between 150 and 500 megawatts of energy. And the vendors have had this ambitious goal of trying to reduce that number to 20, which seems like a lot, ”Geist said.
Then there was reliability: “The fear with the calculations we were doing back then is that failures will happen faster than you could control a job,” Geist said.
It was further believed that competition of a billion lanes would be necessary.
“The question was, could there be more than a handful of applications, or even one, that could use so much parallelism? Geist recalled. “In 2009, full-scale parallelism was generally less than 10,000 knots. And the largest application we’ve ever recorded was only about 100,000 nodes in use. “
The last problem was thorny: the movement of data.
“We saw the whole problem with the memory wall: Basically, the time it took to move data from memory to processors and from processors to storage was actually the biggest bottleneck in doing the math; the computing time was insignificant, ”Geist said. “The time it takes to move a byte is an order of magnitude longer than a floating point operation. “
Geist recalled the DARPA exascale computing report published in 2008 (edited by Peter Kogge). It included an in-depth analysis of what it would take to set up a state-of-the-art 1-exaflops system.
With the technologies of the day, it would take 1,000 MW to build a standard component system, but if you adapt the current trends of flops per watt, you will come across the exascale at around 155 MW with a very optimized architecture, relayed Geist. A barebones setup, dropping the memory of the Strawman system to just 16 gigabytes per node, resulted in a 69-70 MW footprint.
But even the aggressive 70 MW figure was out of reach. An energy-hungry machine was unlikely to get the necessary funding approvals.
“You might be wondering where is [20 MW number] come from? ”Geist asked.“ Actually, it was from a totally non-technical assessment of what was possible. What was possible was saying: it’s going to take 150 MW. What we said was: we we need it to be 20 [MW]. And why we said that is that [we asked] the DOE, “How much are they willing to pay for power over the life of a system?” And the figure that came back from the head of the Office of Science at the time was that they weren’t prepared to pay more than $ 100 million over five years, so it’s a simple calculation [based on an average cost of $1 million per megawatt per year]. The 20 megawatts had nothing to do with what would be possible, it was just this stake that we drove into the ground.
Jumping forward in the presentation (which is available to watch and linked at the end of this article), Geist traces the evolution of the machines to Oak Ridge: Titan to Summit to Frontier. The extreme challenge of concurrency is solved by Frontier’s fat nodes approach, where GPUs hide parallelism inside their pipelines.
“The number of knots didn’t explode – it didn’t take a million knots to get to Frontier,” Geist said. “Actually, the number of nodes is really quite small.”
Where Titan used a one-to-one GPU / CPU ratio, Summit implemented a three-to-one ratio. Frontier’s design takes that up a notch with a four-to-one GPU / CPU ratio.
“Ultimately what we found was that the exascale didn’t need this exotic technology that came out in the 2008 report,” Geist said. “We didn’t need special architectures, we didn’t even need new programming paradigms. It turned out to be very gradual steps, not a giant leap like we thought to get to Frontier. “
When it comes to horsepower, Frontier is expected to exceed the peak performance of one and a half exaflops while consuming no more than 29 megawatts. “It’s actually a little better than the 20 megawatts per exaflop that we’ve just driven into the ground as a rule as opposed to what technology might do,” Geist said. “But in fact, the vendors who worked and designed Frontier did an incredible job responding to it. “
Geist also traces energy efficiency improvements to the DOE’s investment in the FastForward, DesignForward, and PathForward exascale development programs.
“It was [largely] due to these 10 years of DOE investment which [participating] vendors were actually able to reduce the amount of power, their chips and memory needed to be able to perform computational exaflop for just 20 megawatts of power, ”Geist said.
Geist’s energy efficiency calculations are based on peak flops (double precision), not Linpack. A conservatively estimated 70% computational efficiency (Rmax / Rpeak) provides 1050 Linpack petaflops at 29 megawatts, or 36.2 gigaflops per watt. With a computational efficiency of 80%, the energy efficiency drops to 41.4 gigaflops per watt. (Today’s greenest supercomputers approach 30 gigaflops per watt.) Perlmutter, the new # 5 system installed at Berkeley Lab – combining HPE, AMD, and Nvidia technologies and also using a four-to-one GPU / CPU ratio – achieved 25.50 gigaflops per watt. Also note that ORNL has stated that Frontier will be “over” 1.5 exaflops.
Geist also highlighted the reliability improvements due to flash memory on nodes, further made possible by vendors making their networks and system software much more adaptive. (Failure and a smooth restart is key.)
With Frontier, the memory wall issue has been alleviated through the use of HBM on GPUs. “Frontier has HBM (high bandwidth) memory soldered directly to the GPU,” Geist said. “So that increases the bandwidth by an order of magnitude. So that kicks the box for this problem. And GPUs, one of the things caused by high bandwidth is that the latency can be quite high in these cases, but GPUs are actually very well suited, given their pipelines, to hide the latency. .
There is a lot more interesting material in Geist’s presentation, such as the cosmic ray problem, lessons learned from Summit and Sierra, and a question and answer session. Watch the full talk here: https://vimeo.com/562917879