Graphcore claims its IPU-POD outperforms Nvidia A100 in model training

Hear from CIOs, CTOs, and other C-level and senior execs on data and AI strategies at the Future of Work Summit this January 12, 2022. Learn more

Bristol-headquartered Graphcore, a startup developing chips and systems to accelerate AI workloads, appears to be taking on category-leader Nvidia with significant improvements in performance and efficiency.

In the latest MLPerf metrics, Graphcore said its IPU-POD16 server easily managed to outperform Nvidia’s DGX-A100 640GB server. Specifically, when systems were tested to train computer vision model RESNET-50, Graphcore’s unit did the job almost a minute faster. It took 28.3 minutes to train the model, while DGX100 took 29.1 minutes.

Significant time-to-train improvement

The numbers, Graphcore said, represent a 24% jump over the last MLPerf results and can be directly attributed to software optimization. For IPU-POD64, the performance gain was 41%, with the system training RESNET-50 in just 8.50 minutes. Meanwhile, IPU-POD128 and IPU-POD256 — the flagship scale-up systems from Graphcore — took just 5.67 minutes and 3.79 minutes to train RESNET-50.

The MLPerf benchmark is maintained by the MLCommons Association, a consortium backed by Alibaba, Facebook AI, Google, Intel, Nvidia, and others that acts as an independent steward.

The results also detailed the Graphcore system’s ability to handle natural language processing (NLP) workloads. During the test on NLP model BERT, IPU-POD16’s time-to-train stood at 26.05 minutes in MLPerf’s open category (with flexibility in model implementation), while POD64 and POD128 took just 8.25 and 5.88 minutes, respectively.

However, when compared to the last MLPerf benchmarks, the performance gains on BERT were not as high as those seen in the case of RESNET-50.

Graphcore also tested its systems on other workloads to demonstrate how it would handle new, innovative models that customers are exploring to go beyond RESNET and BERT. Part of this was an experiment with EfficientNet B4, a computer vision model that trained in just 1.8 hours on the company’s IPU-POD256. On IPU-POD16, the same model was trained in 20.7 hours — more than three times faster than Nvidia DGX A100.

The development positions Graphcore as a major rival for Nvidia, which is already shipping machines to accelerate AI workloads and holds a major footprint in the segment. Other players in the space include Google and Cerebras Systems. Google’s systems have also outperformed Nvidia’s servers in MLPerf tests, although those were preview machines and not readily available in the market.

Graphcore has raised over $700 million so far and was valued at $2.77 billion following its latest fundraising.

Originally appeared on: TheSpuzz