Nvidia is increasing its focus on the Top500 list of the world’s fastest supercomputers, but with a twist—it’s not just about showcasing the performance of its GPUs. Instead, Nvidia is aiming to highlight the power of its networking technologies. Two new supercomputing systems featuring Nvidia’s Spectrum-X networking have made their debut in the Top 50 of the latest Top500 list.
These systems include Nvidia’s own Israel-1 system, which achieves 42 petaflops, and a 38-petaflop system built by GMO Internet Group. Both systems were constructed by Dell and feature the Bluefield-3 data processing unit and the Spectrum X800 switch. The Israel-1 system uses 936 H100 GPUs, while the GMO system has 768 H200 GPUs.
Nvidia’s director of accelerated computing, Dion Harris, emphasized that these developments mark just the beginning of many more systems to come. With its latest Hopper and Blackwell GPUs, Nvidia is backing InfiniBand as a high-bandwidth, low-latency network interface for AI and high-performance computing (HPC) systems. Spectrum-X, however, allows for broader scaling beyond the NVLink interface, enabling more powerful system expansions.
Harris also highlighted the X.AI system, which uses 100,000 Nvidia H100 GPUs and is based on Spectrum-X. He noted that the Ethernet interconnect in Spectrum-X achieves impressive data throughput, performing at 95% of theoretical capacity compared to traditional Ethernet, which only reaches 60%. This system also maintains zero latency degradation and avoids packet loss, ensuring high performance across three tiers of the network.
In terms of the Top500 list, Nvidia is emphasizing the importance of networking in driving AI performance. While many top systems like Frontier, Aurora, and Lumi use HPE’s Slingshot interconnect, Nvidia’s InfiniBand is used in systems like Eagle and others ranked in the top ten. The company is focused on submitting systems that highlight its networking capabilities, aiming to demonstrate how Spectrum-X can support large-scale installations. With the Top500 systems aging and a decline in new submissions, Nvidia’s approach is a timely push to innovate and refresh the list.
AI is increasingly becoming a part of scientific workloads, and many upcoming discussions at Supercomputing 2024 will focus on mixed-precision benchmarking, a topic of growing importance in AI research.
Nvidia’s Blackwell GPU is progressing well after some initial design challenges earlier in the year. The company is preparing for several partners to announce Blackwell-based servers at SC2024. As a successor to the highly demanded Hopper GPU, Blackwell promises significantly better performance, though it generates more heat.
The company recently launched the GB200 NVL4 server, which can house up to four Nvidia Blackwell GPUs paired with two Grace CPUs, designed for AI and HPC workloads. This server is more efficient than its predecessor, the GH200 NVL4, and delivers a 2.2x speedup in simulations and a 1.8x speedup in inference tasks. For example, in inference with Llama2-7B at FP16 precision, the GB200 outperforms older systems.
Additionally, Nvidia released MLPerf benchmarks showing impressive improvements with Blackwell over its predecessor. The Nyx supercomputer, based on DGX B200 systems, demonstrated 2.2x faster LLM fine-tuning and double the LLM pretraining performance compared to the H100 Tensor Core GPU.
In other news, Nvidia introduced new software and microservices aimed at enhancing AI integration into research. Among the new offerings are Nvidia Inference Microservices (NIMs), which are virtualized containers designed to run AI inference services on GPUs. These include containers for BioNeMo models, which support drug discovery and biological research, as well as a new model for weather forecasting called Earth-2 NIM for CorrDIFF.
The company also launched cuPyNumeric, a drop-in replacement for NumPy, that automatically distributes Python workloads across CPUs and GPUs for faster performance. This tool is designed to scale seamlessly across different computing environments, from single GPUs to multi-node systems, and supports all Nvidia GPU generations.
Finally, Nvidia unveiled an Omniverse reference design for computer-aided design (CAD), which allows engineers to simulate, test, and design products more efficiently. This tool is available through all major cloud providers.