Google may be buying heavens not knows how many GPUs to run HPC and AI workloads on its eponymous public cloud, and it may have recently spoken of its commitment to pushing the industry to innovate at the SoC level and stay away. designing its own compute engines, but the company still builds its own Tensor Computing Units, or TPU for short, to support its TensorFlow machine learning framework and the applications it manages within and as Google. that service for Google Cloud customers.
If you expected to get a big reveal of the search engine giant and machine learning pioneer TPUv4 architecture at its Google I / O 2021 conference this week, you were undoubtedly, like us, deeply disappointed. . In his two-hour opening speech, which you can see here, Google CEO Sundar Pichai, who is also CEO of Google’s parent company Alphabet, spoke very briefly about the upcoming TPUv4 custom ASIC designed by Google and presumably built by Taiwan. Semiconductor Manufacturing Corp like every other advanced compute engine on Earth. As the name suggests, the TPUv4 chip is Google’s fourth generation of machine-learning Bfloat processing beasts, which it pairs with host systems and the network to create what amounts to a custom supercomputer.
“It’s the fastest system we’ve ever deployed at Google – a historic milestone for us,” Pichai said in his speech. “Previously, to get an exaflops, you had to build a custom supercomputer. But we have already deployed a lot today. We will soon have dozens of TPUv4 pods in our data centers, many of which will be running at 90% or near carbon-free energy. And our TPUv4 pods will be available to our cloud customers later this year. It is extremely exciting to see this pace of innovation. “
First of all, it doesn’t matter what Pichai says, what Google is building when it installs the TPU pods in its data centers to run its own AI workloads and also to allow others to run the ones. their using Google Cloud and its AI platform stack a service is absolutely a custom supercomputer. It is the very definition of a personalized supercomputer, In reality. We certainly have our ‘need more coffee’ days here at The next platform, as evidenced by typos, broken sentences, etc., but we are running high speed every day and we are not using Google with a team of speech writers and doing a pre-recorded event. Have more coffee, Sundar. We will send you a Starbucks Card. Have a good sip and tell us about the new TPUv4 chip. (In fact, Urs Hölzle, senior vice president of technical infrastructure at Google, promised us a briefing on TPUv4, and we’re officially reminding him of that here, right now.)
Pichai didn’t say much about the TPUv4 architecture, but we can infer some things from the little he said – and we also won’t need a TPU ASIC to make the inference.
This graphic literally blew us away with its rarity – and odd inaccuracy unless you can deduce what Pichai must have meant, which we think we have. There’s an oversimplification to the point of ridiculousness, and given that it’s supposed to be the 2021 Google I / O nerdfest, we’re, as we said, a bit disappointed. In any case, the graph actually shows TPUv3 with five performance units and TPUv4 with ten performance units, which equates to precisely 2X the performance. But the label says “More than 2x faster”, which will confuse some.
If this was an actual technical presentation, what Pichai might have said is that the TPUv4 has twice as many compute units running at the same clock speed due to a reduction process that allows each TPU socket to have twice as many compute elements – and presumably at least twice as much HBM2 memory and at least twice as much aggregate bandwidth to balance it out. But Pichai didn’t say anything about it.
But we are, and that’s what we think Google has done, in essence. And frankly, it’s not so much stretch, technologically speaking, if that’s all Google has done to move from TPUv3 to TPUv4. Hope there are more.
Maybe a review is in order and then we’ll get to what the more than 2x faster thing might mean. The previous two generations of TPUs and the one that is being rolled out are scalar / vector processors with a bunch of 128 × 128 Bfloat16 matrix math engines attached and some HBM2 memory powering the math units.
Here is a table that summarizes the previous TPUv2 and TPUv3 units and the server cards that used them:
The base TPU core is a scalar / vector unit – what we call a processor these days since Intel, AMD, Power, and Arm processors all have a combination of these – which has a Bfloat matrix math unit. , which Google calls an MXU. There are two cores on a TPU chip. This MXU can handle 16,384 floating point operations in Bfloat format per clock, and with the TPUv2 core, it can drive 23 teraflops of Bfloat operations, which corresponds to 46 teraflops per chip. We never knew the clock speed, but we assume it is somewhere north of 1 GHz and south of 2 GHz, just like a GPU. Our estimate for TPUv2 is 1.37 GHz, in fact, and for TPUv3, it’s around 1.84 GHz. We have dived into the TPUv2 and TPUv3 architectures here, if you really want to dive into them as well as the intricacies of the Bfloat format, which is very smart, read this. The TPUv3 wattage ratings were very low. We think TPUv2 was etched in 20 nanometer processes and TPUv3 was etched in 16 nanometer or maybe 12 nanometer processes, and we assume Google made a reduction to 7 nanometers with TPUv4 while still staying in the thermal envelope of 450 watts per outlet that its TPUv3 Pods required. We don’t think there is much thermal room to increase clock speed with TPUv4. Sorry. As it stands, the increase in memory could push it to 500 watts.
Anyway, with TPUv3, downsizing allowed Google to place two MXUs against the scalar / vector unit, doubling the raw performance per core at constant frequency; we suspect that Google may also have reduced clock speeds a bit. The TPUv3 had two cores per chip and doubled the memory up to 16 GB of HBM2 per core compared to 8 GB per core for the TPUv2 chip.
So using our handy dandy rule and a 2X multiplier, we think Google went down to 7 nanometers and got four cores on a dice. It can do this by creating a monolithic TPUv4 chip, or it can experiment with chips and create an interconnect that connects two or four chips to each other in one socket. It really depends on how the latency sensitive workloads are in a socket. Because HBM2 memory hangs on MXUs, as long as MXUs all have their own HBM2 controller, we really don’t think that matters much. So if we did that and wanted to increase the yield of the TPUv4 matrix and also reduce the cost of the chips (but will refund some of that on the chip packaging), we would take four TPUv3 cores and split them into chips for them. to manufacture. a TPUv4 socket. But it looks like Google is sticking with a monolithic design.
We would also push the thermals as high as possible. TPUv2 weighed 280 watts and TPUv3 went up to 450 watts to generate 123 teraflops of performance. (Which implies a 33.7% increase in clock speed from TPUv2 to TPUv3, but paying for it with a 60.7% increase in power from 280 watts to 450 watts.)
We believe that the HBM memory of the TPUv4 device has doubled, but the HBM2 per core memory could be the same at 16 GB per core. It would be 64 GB per device, and that’s a lot. (Yes, we know Nvidia can do 80GB per device.) There’s an out-of-the-box chance that Google could push that up to 128GB per device, or 32GB per core. It really depends on the thermals and the cost. But what we do know for sure is that Google and other AI researchers really want more HBM2 memory to be available on these devices. We believe that the clock speed of the TPUv4 device is very unlikely to increase much. Who wants a 600 watt room?
Now, let’s talk about that “Over 2x faster” comment above. Last July, Google released some first data comparing the performance of TPUv4 on the MLPerf suite of AI benchmarks against TPUv3 devices. Looked:
On various components of the MLPerf Machine Learning training benchmarks, the performance increase from TPUv3 machines with 64 chips (128 cores) to TPUv4 machines also with 64 chips (and 128 cores) ranged from 2.2X to 3.7X, and was on average about 2.7X for these five tests. So this could be the “More than 2x faster” that Pichai talks about. But that’s not what his graph shows. The difference between the 2X hardware peak performance capability and the average 2.7X increase in MLPerf performance is – you guessed it – software optimization.
The TPU pods are virtually split as follows. Here is the TPUv2 pod:
And here is the TPUv3 pod:
The larger TPUv2 image had 512 cores and 4 TB of HBM2 memory, and the larger TPUv3 image had 2,048 cores and 32 TB of memory.
Now, Pichai has stated that the TPUv4 pod will have “4,096 chips” and assuming it doesn’t mean cores, that could mean it has 4,096 sockets and each socket has a monolithic chip. This matches what Pichai said and gets the TPUv4 pod at just over 1 exaflops with Bfloat16 accuracy. (The TPUv2 pod could only scale to 256 chips and 11.8 petaflops and the TPUv3 pod could only scale to 1024 chips and 125.9 petaflops, by comparison.) This 1 exaflops assumes clock speeds and the thermals for the TPUv4 socket are about the same as the TPUv3 socket and Google quadrupled the number of sockets.
We also believe that the TPU instance will be able to scale on all of these 4,096 chips and sockets in a single system image, with at least 64TB of aggregated HBM2 memory. And with software enhancements, more of that peak performance will drive workloads. We’ll see how much when Google tells us more.
One more thing: Pichai also said that the TPUv4 pod has “10 times the large-scale chip interconnect bandwidth compared to any other network technology.” Comparing the TPUv4 server card to the TPUv3 card in the diagrams above, it appears that each TPUv4 socket has its own network interface; the TPUv3 card had four sockets sharing two interconnects. (Or, it looks like this. We’re not sure that’s correct. It could be two-port router chips.) We can’t wait to learn more about TPUv4 interconnect.