Nvidia’s new DGX SuperPOD can handle trillion-parameter AI models

Join leaders in Boston on March 27 for an exclusive night of networking, insights, and conversation. Request an invite here.


Nvidia is launching its most powerful systems yet with the new DGX SuperPod, as part of a broad rollout of hardware and software at the Nvidia GTC conference today.

The DGX in recent years has become one of Nvidia’s primary server hardware and cloud systems. The new DGX SuperPod system is powered by Nvidia’s next generation of GPUs for AI acceleration, known as Blackwell, which is being announced at GTC, as the successor to the Hopper GPU. The Blackwell is being positioned by Nvidia to support and enable AI models that have a trillion parameters. 

The DGX SuperPOD integrates the GB200 superchip version of the Blackwell, which includes both CPU and GPU resources. Nvidia’s previous Grace Hopper generation of superchip is at the core of the prior generation of DGX systems. Existing DGX systems from Nvidia are already widely deployed for numerous use cases including drug discovery, healthcare, fraud detection, financial services, recommender systems and consumer internet.

“It’s a world-class supercomputing platform and it’s turnkey,” Ian Buck, VP of Hyperscale and HPC at Nvidia said during a press briefing. “It supports Nvidia’s full AI software stack, providing unmatched reliability and scale.”

VB Event

The AI Impact Tour – Atlanta

Continuing our tour, we’re headed to Atlanta for the AI Impact Tour stop on April 10th. This exclusive, invite-only event, in partnership with Microsoft, will feature discussions on how generative AI is transforming the security workforce. Space is limited, so request an invite today.

Request an invite

What’s inside a DGX SuperPOD?

While the term SuperPOD might seem like just a marketing superlative, the actual hardware that Nvidia is packing into its new DGX system is impressive.

A DGX SuperPOD isn’t just a single rack server, it’s a combination of multiple DGX GB200 systems. Each DGX GB200 system features 36 Nvidia GB200 Superchips, which include 36 Nvidia Grace CPUs and 72 Nvidia Blackwell GPUs, connected as a single supercomputer via fifth-generation Nvidia NVLink.

Now what makes the SuperPOD, super is that the DGX SuperPOD can be configured with eight or more DGX GB200 systems and can scale to tens of thousands of GB200 Superchips connected via Nvidia Quantum InfiniBand. 

The system can deliver 240 terabytes of memory, which is critical for large language model (LLM) training and generative AI inference at a massive scale. Another impressive figure claimed by Nvidia is that the DGX SuperPOD has 11.5 exaflops of AI supercomputing power.

Advanced networking and data processing units enable gen AI SuperPOD fabric

A core element of what makes a DGX SuperPOD super is the fact that so many GB200 systems can be connected together with a unified compute fabric.

Powering that fabric in the newly announced Nvidia Quantum-X800 InfiniBand networking technology. This architecture provides up to 1,800 gigabytes per second of bandwidth to each GPU in the platform. 

The DGX also integrates the Nvidia BlueField-3 DPUs (data processing unit) and the fifth generation of the fifth-generation Nvidia NVLink interconnect.

Additionally, the new SuperPOD includes fourth-generation Nvidia Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) technology. According to Nvidia, the new version of SHARP delivers 14.4 teraflops of in-network computing, representing a 4x increase in the next-generation DGX SuperPOD architecture compared to the previous generation.

Blackwell coming to Nvidia DGX Cloud

The new GB200-based DGX systems are also coming to the Nvidia DGX cloud service.

The GB200 capabilities will be available first on Amazon Web Services (AWS), Google Cloud and Oracle Cloud.

“DGX Cloud is our cloud that we partnered deeply and co-designed with our cloud partners to provide the best Nvidia technology for Nvidia’s own use for our own AI research and development in our products, but also to make available to our customers,” Buck said.

The new GB200 will also help to advance the Project Ceiba supercomputer that Nvidia has been developing with AWS which was first announced in November 2023. Project Ceiba is an effort to use DGX Cloud to create the world’s largest public cloud supercomputing platform.

“I’m pleased to announce that Project Ceiba has skipped ahead, we’ve now upgraded it to be Grace Blackwell supporting  20,000 GPUs,” Buck said. “It will now deliver over 400 exaflops of AI.”

Originally appeared on: TheSpuzz

Scoophot
Logo