Were you unable to attend Transform 2022? Check out all of the summit sessions in our on-demand library now! Watch here.
Artificial intelligence (AI) and machine learning (ML) are about more than algorithms: The right hardware to turbocharge your AI and ML computations is key.
To speed up job completion, AI and ML training clusters need high bandwidth and dependable transport with predictable low-tail latency (tail latency is the 1 or 2% of a job that trails the rest of responses). A high-performance interconnection can optimize data center and high-performance computing (HPC) workloads across your portfolio of hyperconverged AI and ML training clusters, resulting in lower latency for better model training, increased data packet utilization and lower operational costs.
As AI and ML training jobs become more prevalent, it’s critical to have higher radix switches, which decrease latency and power, and higher port speeds for building bigger training clusters with flat network topology.
Ethernet switching for performance optimization
While network bandwidth requirements in data centers continue to rise dramatically, there is also a strong push to combine general compute and storage infrastructure with optimized AI and ML training processors. As a result, AI and ML training clusters — where you specify multiple machines for training — are driving the demand for fabrics with high-bandwidth connectivity, high radix and faster job completion while operating at high network utilization.
MetaBeat will bring together thought leaders to give guidance on how metaverse technology will transform the way all industries communicate and do business on October 4 in San Francisco, CA.
To speed up job completion, it’s critical to have effective load balancing to achieve high network utilization, as well as congestion-control mechanisms to achieve predictable tail latency. Virtualized and efficient data infrastructures, combined with capable hardware, can also improve CPU offloads and assist network accelerators in improving neural network training.
Ethernet-based infrastructures currently offer the best solution for a unified network. They combine low power with high bandwidth and radix, and the fastest serializer and deserializer (SerDes) speeds, with a predictable doubling of bandwidth every 18 to 24 months. With these advantages, as well as its large ecosystem, Ethernet can provide the highest performance interconnect per watt and dollar for AI and ML and cloud-scale infrastructure.
According to IDC, the global Ethernet switch market grew 12.7% year-on-year to $7.6 billion in the first quarter of 2022 (1Q22). Broadcom offers the Tomahawk family of Ethernet switches to enable the next generation of unified networks.
Today, San Jose-based Broadcom announced the StrataXGS Tomahawk 5 switch series, which offers 51.2 Tbps of Ethernet switching capacity in a single, monolithic device – more than double the bandwidth of its contemporaries, the company claims.
“Tomahawk 5 has twice the capacity of Tomahawk 4. As a result, it is one of the world’s fastest-switching chips,” said Ram Velaga, senior vice president and general manager of Broadcom’s core switching group. “The newly added specific features and capabilities to optimize performance for AI and ML networks make [the] Tomahawk 5 twice as fast as the previous version.”
The Tomahawk 5 switch chips are designed to aid data centers and HPC environments, to accelerate AI and ML capabilities. The switch chip uses a Broadcom approach known as cognitive routing, an advanced shared-packet buffering, programmable in-band telemetry, with hardware-based link failover built into the chip.
Cognitive routing optimizes network link utilization by automatically selecting the system’s least heavily loaded links for each flow that passes through the switch. This is especially important for AI and ML workloads, which frequently combine short- and long-lived high-bandwidth flows with low entropy.
“Cognitive routing is a step beyond adaptive routing,” Velaga said. “When using adaptive routing, you are only aware of data congestion between two points but are unaware of the other ends.”
Cognitive routing, he added, can make the system aware of conditions apart from the next neighbor, rerouting for an optimal path that provides better load balance while avoiding congestion.
Tomahawk 5 includes real-time dynamic load balancing, which monitors the use of all links at the switch and downstream in the network to determine the best path for each flow. It also monitors the status of hardware links and automatically redirects traffic away from failed connections. These features improve network utilization and reduce congestion, resulting in a shorter job completion time.
The future of Ethernet for AI and ML infrastructures
Ethernet has the characteristics required for high-performance AI and ML training clusters: high bandwidth, end-to-end congestion management, load balancing and fabric management at a lower cost than its contemporaries, such as InfiniBand.
It’s clear that Ethernet is a robust ecosystem that is constantly developing at a rapid pace of innovation. “Ethernet is relentless, and I would expect it to continue encroaching on areas like AI/ML,” Craig Matsumoto, senior research analyst at 451 Research, told VentureBeat. “The reward is homogeneity – if I can run every workload on Ethernet, assuming the performance is good enough, I can have one homogenous network that all workloads can share. It’s simpler, and it buys me more redundant paths for forwarding traffic.”
Broadcom has shown that it will continue to improve its Ethernet switches to keep up with the pace of innovation happening in the AI and ML industry, and remain part of the HPC infrastructure into the future.