Cerebras Systems sets record for largest AI models ever trained on one device

We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 – 28. Join AI and data leaders for insightful talks and exciting networking opportunities. Register today!


Cerebras Systems said it has set the record for the largest AI models ever trained on a single device, which in this case is a giant silicon wafer with hundreds of thousands of cores.

I could say that this is the record for a single chip, but Cerebras makes one large chip out of an 8.5-inch-wide silicon wafer that would normally be sliced into hundreds of chips. So the word “device” will have to do as no one else makes such a huge chip with 850,000 cores and 2.55 trillion transistors.

The advantage of a dinner-plate sized wafer

The Cerebras CS-2 system can train multibillion-parameter natural language processing (NLP) models including GPT-3XL 1.3 billion models, as well as GPT-J 6B, GPT-3 13B and GPT-NeoX 20B. Cerebras said that for the first time ever, a single CS-2 system with one Cerebras wafer can train models with up to 20 billion parameters — a feat not possible on any other single device. One of the CS-2 systems fits inside a standard datacenter rack and it’s about 26 inches tall.

By enabling a single CS-2 to train these models, Cerebras reduces the system-engineering time necessary to run large NLP models from months to minutes. It also eliminates one of the most painful aspects of NLP — namely the partitioning of the model across hundreds or thousands of small graphics processing units (GPUs). 

“It takes about 16 keystrokes to set up,” Andrew Feldman, CEO of Cerebras Systems, said in an interview.

The disadvantage of using GPUs with AI models

Feldman explained that bigger models are shown to be more accurate for NLP. But few companies had the resources and expertise to do the painstaking job of breaking up these large models and spreading them across hundreds or thousands of GPUs, which are the computing rival to Cerebras’ devices.

“It means every network has to be reorganized, redistributed, and all the work done again, for every cluster,” he said. “If you want to change even one GPU in that cluster, you have to redo all the work. If you want to take the model to a different cluster, you redo the work. If you want to take a new model to this cluster, you have to redo the work.”

Cerebras is democratizing access to some of the biggest models in the AI ecosystem, Feldman said.

“GSK generates extremely large datasets through its genomic and genetic research, and these datasets require new equipment to conduct machine learning,” said Kim Branson, senior vice president of AI and machine learning at GSK, in a statement. “The Cerebras CS-2 is a critical component that allows GSK to train language models using biological datasets at a scale and size previously unattainable. These foundational models form the basis of many of our AI systems and play a vital role in the discovery of transformational medicines.” 

These capabilities are made possible by a combination of the size and computational resources available in the Cerebras Wafer Scale Engine-2 (WSE-2) and the Weight Streaming software architecture extensions available via release of version R1.4 of the Cerebras Software Platform, CSoft.  

Cerebras’ CS-2 wafer-size chip.

When a model fits on a single processor, AI training is easy, Feldman said. But when a model has either more parameters than can fit in memory, or a layer requires more compute than a single processor can handle, complexity explodes. The model must be broken up and spread across hundreds or thousands of GPUs. This process is painful, often taking months to complete.

“We’ve taken something that currently takes the ML community months to do and we’ve turned it into 16 keystrokes,” Feldman said.

Reducing the need for systems engineers

To make matters worse, the process is unique to each network compute cluster pair, so the work is not portable to different compute clusters, or across neural networks. It is entirely bespoke, and it’s why companies publish papers about it when they pull off this accomplishment, Feldman said. It’s a huge systems-engineering problem, and it’s not something that machine learning experts are trained to do.

“Our announcement brings to any organization access to the largest models by showing they can be trained quickly and easily on a single device,” Feldman said.

He said it is hard to do this on a cluster of GPUs because “spreading a large neural network over a cluster of GPUs is profoundly difficult.”

He added, “It’s a multidimensional Tetris problem, where you have to break up compute and memory and communication and distribute them across hundreds or thousands of graphics processing units.”

The largest processor ever built

Cerebras
Cerebras has a number of supercomputing customers.

The Cerebras WSE-2 is the largest processor ever built. It is 56 times larger, has 2.55 trillion more transistors, and has 100 times as many compute cores as the largest GPU. The size and computational resources on the WSE-2 enable every layer of even the largest neural networks to fit. The Cerebras Weight Streaming architecture disaggregates memory and compute, allowing memory (which is used to store parameters) to grow separately from compute. Thus a single CS-2 can support models with hundreds of billions, even trillions, of parameters.  

“Just by way of reminder, when we say we’re big, we have 123 times more cores and 1,000 times more memory and 12,000 times more memory bandwidth” than a GPU solution, Feldman said. “And we invented a technique called weight streaming, where we could keep memory off chip disaggregated from the wafer.”

Graphics processing units, on the other hand, have a fixed amount of memory per GPU, Feldman said. If the model requires more parameters than fit in memory, one needs to buy more graphics processors and then spread work over multiple GPUs. The result is an explosion of complexity. The Cerebras solution is far simpler and more elegant: by disaggregating compute from memory, the Weight Streaming architecture allows support for models with any number of parameters to run on a single CS-2. 

Revolutionizing setup time and portability

Powered by the computational capacity of the WSE-2 and the architectural elegance of the Weight Streaming architecture, Cerebras is able to support, on a single system, the largest NLP networks, Feldman said. By supporting these networks on a single CS-2, Cerebras reduces setup time to minutes and enables model portability. One can switch between GPT-J and GPT-Neo, for example, with a few keystrokes, a task that would take months of engineering time to achieve on a cluster of hundreds of GPUs. 

cerebras wse 2
Cerebras claims big advantages over GPUs.

“Cerebras’ ability to bring large language models to the masses with cost-efficient, easy access opens up an exciting new era in AI. It gives organizations that can’t spend tens of millions an easy and inexpensive on-ramp to major league NLP,” said Dan Olds, chief research officer at Intersect360 Research, in a statement. “It will be interesting to see the new applications and discoveries CS-2 customers make as they train GPT-3 and GPT-J class models on massive datasets.”

Worldwide adoption

Cerebras has customers in North America, Asia, Europe and the Middle East. It is delivering AI solutions to a growing roster of customers in the enterprise, government and high-performance computing (HPC) segments including GSK, AstraZeneca, TotalEnergies, nference, Argonne National Laboratory, Lawrence Livermore National Laboratory, Pittsburgh Supercomputing Center, Leibniz Supercomputing Centre, National Center for Supercomputing Applications, Edinburgh Parallel Computing Centre (EPCC), National Energy Technology Laboratory, and Tokyo Electron Devices. 

“Not only do we have these customers, but they’re out there saying really nice things about us,” said Feldman. “AstraZeneca said training which used to take two weeks on clusters of GPUs, we accomplished in a few days.”

GSK said Cerebras was able to perform work 10 times faster than 16 GPUs.

“Lots of cool customers are solving interesting problems,” said Feldman. “The amount of compute used in these big language models has been growing exponentially. And these language models have gotten so large that only a tiny portion of the market can train them. We have a change that gives the vast majority of the economy the ability to train these models to any organization with access to the largest models.”

Originally appeared on: TheSpuzz

Scoophot
Logo