Published on November 9, 2023, 7:08 am
Nvidia Maintains Lead Over Intel in MLPerf 3.1 Benchmark; Announces New Supercomputer
In the world of Artificial Intelligence (AI), Nvidia continues to dominate as it maintains its lead over Intel in the latest MLPerf training benchmark, version 3.1. The benchmark, released recently, revealed that Nvidia’s H100 GPU remains at the forefront when it comes to performance and versatility.
However, Intel’s Gaudi 2 AI chip displayed a significant leap in performance compared to the previous round. It even surpassed Nvidia’s A100 and came significantly closer to the H100, particularly when training large language models. This development has analysts speculating that Intel could potentially catch up with Nvidia in certain areas by 2024 with the anticipated release of Gaudi 3.
Interestingly, Nvidia also showcased its expertise in building highly powerful and scalable systems. The company introduced its new Eos AI supercomputer, which featured an impressive configuration of 10,752 H100 GPUs and Nvidia’s Quantum-2 InfiniBand network. With this system, Eos achieved remarkable results: it trained a GPT-3 model with an astounding 175 billion parameters and 1 billion tokens in just 3.9 minutes. This achievement shattered Nvidia’s own record set less than six months ago using under 3,500 H100 GPUs.
Moreover, the benchmark demonstrated that Nvidia’s technology exhibits excellent scalability while maintaining efficiency. Extending the number of GPUs threefold resulted in a 2.8x performance scaling—a remarkable efficiency rate of 93%. This represents a significant improvement compared to last year and can be attributed, in part, to software optimizations.
Apart from Nvidia, Microsoft also participated in the MLPerf benchmark by submitting results using Azure HD H100 v5 for a system equipped with 10,752 H100 GPUs. Impressively, Microsoft completed GPT-3 training within approximately four minutes.
According to Nvidia’s projections, Eos would only require eight days to complete a full training run of a modern GPT-3 model with 175 billion parameters and an optimal amount of data comprising 3.7 trillion tokens. This projection aligns with the goal of generating a model similar to GPT-3.5—the original model behind ChatGPT.
While it remains unclear how much data OpenAI used to train GPT-4.5, it is known that GPT-3 was trained using only 300-500 billion tokens. Rumors indicate that GPT-4 might have been trained with nearly 13 trillion tokens. As for the elusive GPT-3.5, it likely falls somewhere in between, although OpenAI seems to be leaning towards developing a smaller model with GPT-3.5-turbo.
For the first time, Stable Diffusion training was included in the MLPerf benchmark. It took 2.5 minutes with 1,024 Nvidia H100 GPUs and approximately ten minutes with 64 H100s—indicating that training diffusion models are not as efficient as training large language models. On the other hand, Intel’s Gaudi 2 performed similarly by completing the same task in just under twenty minutes using 64 accelerators.
The MLPerf benchmarks receive support from prominent organizations such as Amazon, Arm, Baidu, Google, Harvard University, Hewlett Packard Enterprise (HPE), Intel, Lenovo, Meta (formerly Facebook), Microsoft, Nvidia, Stanford University, and the University of Toronto. These transparent and objective tests provide users with reliable results for making informed purchasing decisions.
In conclusion, Nvidia maintains its leading position in AI technology by demonstrating superior performance with its H100 GPU in the latest MLPerf benchmark. However, Intel’s advancements suggest potential competition for Nvidia in upcoming years. Moreover, Nvidia’s Eos supercomputer impresses with its scalability and efficient training of large language models. These developments pave the way for further innovation and advancements in the field of AI.