Nvidia Megatron: Not a robot in disguise, but a large language model that’s getting faster

Join executives from July 26-28 for Transform’s AI & Edge Week. Hear from top leaders discuss topics surrounding AL/ML technology, conversational AI, IVA, NLP, Edge, and more. Reserve your free pass now!


In the fictional Transformers universe, Megatron is an evil robot bent on dominating his rivals. Nivida’s Megatron has no such insidious goals, and has the somewhat more altruistic goal of enabling better, faster large language models (LLMs).

A transformer in the AI world is not a robot that turns into a vehicle, but rather is a type of technology used in AI deep learning models for natural language processing (NLP). The Nvidia NeMo Megatron framework for LLMs is now being updated to help organizations train data faster than ever before, with updates for the underlying open-source Megatron LM transformer technology. Nvidia claims that the new updates will accelerate training speed by 30% for models that can be as large as a 1 trillion parameters.

“Large language models are very interesting to the research community today,” Ujval Kapasi, VP of deep learning software at Nvidia, told VentureBeat. “Once you pretrain a large language model that has enough parameters, and I’m talking about like into the hundreds of billions of parameters, it it takes on this property where it can effectively execute multiple types of language tasks, without having to be retrained individually for every single task.”

More power for even larger large language models

Megatron is currently in what Nvidia refers to as “early access,” but it’s already being used to train some of the largest models on the planet.

Megatron was used to help train BLOOM (BigScience Large Open-science Open-access Multilingual Language Model) that was released on July 12, with support for 46 human languages and 13 programming languages.

“People are using it to efficiently train large models of up to a trillion parameters; these large language models run on clusters of GPUs,” Kapasi said. “Our stack is specifically optimized for Nvidia DGX SuperPODs, but the stack also works well on cloud systems.”

As a framework, NeMo Megatron is a “top-to-bottom” stack, according to Kapasi. Meaning it includes GPU-accelerated machine learning libraries, hardware and networking optimizations for cluster deployments. At the foundational layer, Kapasi explained, NeMo Megatron is built on top of the open-source PyTorch machine learning framework.

Large language models aren’t just for large research organizations either, they also are finding a home within enterprises. Kapasi commented that enterprises may want to take a pretrained model and then adapt it for their own use cases. Common enterprise deployments can include things like chatbots, as well as question and answer services.

It’s not Energon making Megatron faster, it’s math

The fictional Megatron is powered by a substance known as “Energon,” but when it comes to Nvidia’s Megatron, it’s mostly math. That math – and the way compute, memory and process parallelization occurs – is now being improved in Megatron to make the model much faster.

“Basically, the main impact of these new features is that you can train larger models more efficiently and the way they do that is by both reducing the amount of memory required during the training process and reducing the amount of computation required,” Kapasi said.

One of the new features is a technique called selective activation recomputation. Kapasi explained that within an AI transformer, there is a need to maintain process states in memory. For various reasons, there are some pieces of state that disproportionately take up a larger amount of memory, yet they require a very small percentage of the overall compute resources to regenerate. What Nvidia has now figured out is how to better optimize which items can be recomputed as needed, rather than continuously consuming memory, providing better overall efficiency.

The other new feature that helps to accelerate Megatron is called sequence parallelism. With very large LLMs, all the parameters cannot fit on a single GPU. As such, they are distributed across multiple GPUs using various parallel processing techniques. Kapasi explained that the new sequence parallelism approach is more optimized than prior approaches, requiring less compute and memory resources.

“These new improvements are not some fancy memory allocation system,” Kapasi said. “It’s more about understanding the math inside the transformer and taking advantage of the properties of the math to more efficiently use the memory and the computation resources we have.”

Originally appeared on: TheSpuzz

Scoophot
Logo