Customers now have access to Google’s custom hardware, including its Axion CPU and the latest Trillium TPU, through its Cloud service. Along with this, Google gave a sneak peek at Nvidia’s Blackwell platform, which is set to join Google Cloud early next year.
Mark Lohmeyer, Vice President and General Manager for Compute and AI Infrastructure at Google Cloud, mentioned in a blog post that the company is excited about the potential of Nvidia’s Blackwell GB200 NVL72 GPUs, and they look forward to providing more updates soon.
Google is already preparing its cloud infrastructure to support Blackwell, and it seems the company is moving away from the previous head-to-head comparisons between its TPUs and Nvidia’s GPUs. Instead, Google is making efforts to integrate Nvidia’s AI hardware more seamlessly into its cloud, with the introduction of a new network adapter designed to connect with Nvidia’s hardware.
Google aims to create a smooth and unified hardware and software experience in its cloud service, ensuring that customers can use different technologies without disruption.
This shift in approach is part of a broader trend in the chip industry, where rivals are putting aside their differences. AMD and Intel recently collaborated to keep x86 relevant in the AI space, and now Google is positioning itself to offer both its hardware and Nvidia’s hardware for inference, recognizing that diversity in cloud services is beneficial for business.
The demand for AI hardware is massive, and Nvidia’s GPUs are in short supply. As a result, customers are increasingly turning to Google’s TPUs.
Google’s new Trillium TPU, which replaces the TPU v5, is now available for preview and offers significant performance gains. Trillium is essentially a TPU v6, which was introduced just a year after the TPU v5. This rapid progression is impressive, considering the usual three- to four-year gap between previous generations.
Trillium delivers up to 4.7 times more peak compute performance than the TPU v5e when processing BF16 data. This translates to a theoretical peak performance of 925.9 teraflops for Trillium, compared to the 197 teraflops of the TPU v5e. However, real-world performance is always lower than theoretical estimates.
Google has shared several real-world AI benchmarks to highlight Trillium’s improvements. For example, the text-to-image Stable Diffusion XL inference is 3.1 times faster on Trillium than on the TPU v5e. Training on the Gemma2 model with 27 billion parameters is four times faster, while training on the 175-billion parameter GPT3 is about three times faster.
Trillium also features numerous enhancements, including double the HBM memory of the TPU v5e, which had 16GB of HBM2 capacity. Google didn’t specify whether Trillium uses HBM3 or HBM3e, which are found in Nvidia’s H200 and Blackwell GPUs. HBM3e offers greater memory bandwidth than HBM2.
Additionally, Trillium’s inter-chip communication has been doubled compared to the TPU v5e. Google’s infrastructure allows for the creation of supercomputers using tens of thousands of Trillium chips, with a technology called Multislice that distributes large AI workloads across multiple chips, ensuring high efficiency and uptime.
Trillium also benefits from third-generation SparseCores, a chip positioned near the high-bandwidth memory for more efficient AI processing.
Google’s Axion CPUs, designed to pair with Trillium, are now available for use in virtual machines (VMs) for AI inference. These ARM-based CPUs are offered in Google’s C4A VM instances and promise 65% better price-performance and up to 60% better energy efficiency than similar x86-based instances for tasks like web-serving, analytics, and database management.
However, it’s worth noting that for more demanding workloads, such as databases and ERP applications, a more powerful x86 chip may be needed. Independent benchmark comparisons between Google Cloud Axion and x86 instances are available from Phoronix.
Nvidia’s H200 GPU is now available in Google Cloud’s A3 Ultra virtual machines, and Google has developed a direct connection between its hardware infrastructure and Nvidia’s hardware via high-speed networking. The core of this system is Titanium, a hardware interface designed to optimize workload, traffic, and security management.
Google has introduced a new Titanium ML network adapter, which builds on Nvidia’s ConnectX-7 hardware to support Virtual Private Clouds (VPCs), traffic encryption, and virtualization. Lohmeyer noted that while Titanium’s capabilities benefit AI infrastructure, the unique performance needs of AI workloads require special consideration for accelerator-to-accelerator communication.
The Titanium ML adapter creates a virtualization layer that allows Google Cloud to run a virtual private cloud environment while leveraging Nvidia’s hardware for AI workloads. However, it’s still unclear whether the Titanium ML interface will enable customers to switch between Google’s Trillium and Nvidia GPUs within the same AI workloads. Lohmeyer previously mentioned that this could be made possible through containers, though Google has yet to provide further details.
Nvidia’s hardware offers a proven blueprint for GPU-optimized offload systems, and Google has its own system for managing GPU workloads in its cloud. The Hypercomputer interface, for instance, includes a “Calendar” consumption model for scheduling tasks, along with a “Flex Start” model that guarantees task completion and delivery times.
Lastly, Google announced the Hypercluster, a system that enables customers to deploy predefined AI and HPC workloads with a single API call. This system automates network, storage, and compute management, which can often be complex to handle manually. Google is also adopting AWS’s SLURM (Simple Linux Utility for Resource Management) scheduler to give customers more control over their HPC clusters, though further details on its integration into Hypercluster are yet to be revealed.