Nvidia's Delayed Blackwell AI Chips Are Overheating in Servers

Nvidia is trying to redesign its server racks for the Blackwell chips, which may cause further delays.

Nvidia's Delayed Blackwell AI Chips Are Overheating in Servers

Nvidia's upcoming Blackwell GPUs for AI computing may face further delays because they're prone to overheating when connected to each other on server racks, The Information reports.

The issue has reportedly been traced to the server rack Nvidia designed for Blackwell—which can connect up to 72 GPUs at a time. Nvidia has repeatedly redesigned the racks, which could delay GPU server shipments and the opening of new data centers for Google, Microsoft, or Meta.

In August, a previous report suggested that a "design flaw" had caused the Blackwell GPUs' launch to be delayed by months. It's unclear whether this flaw is the server rack design issue. Nvidia announced Blackwell in March and initially said the GPUs could ship as soon as Q2 2024 before it encountered challenges.

Nvidia indirectly addressed the server rack problem in a statement to Reuters. "Nvidia is working with leading cloud service providers as an integral part of our engineering team and process. The engineering iterations are normal and expected," a company spokesperson said, suggesting a new server design could be on the horizon.

Overheating is the main cause of performance issues for GPUs, which require a lot of energy to operate. The crypto mining industry, like AI, also uses a ton of energy, produces a lot of heat, and relies on high numbers of interconnected GPUs or mining rigs. Sometimes, crypto miners use immersion cooling, where the rigs are submersed in liquid, to prevent overheating.

And the more powerful a GPU, the more heat it can produce. While sometimes tech advancements can bring more energy efficiencies, this typically isn't enough to offset the increased energy needs overall. The Blackwell AI chips can be 30 times faster than previous GPUs, according to Nvidia.

Training and running generative AI models at scale requires a lot of energy, too, as well as water to cool these systems. This has led some experts to predict that AI data centers may face power shortages as soon as next year. AI firms aren't able to add new power sources to grid as quickly as they can add data centers—and they aren't necessarily willing to wait, either.

Meta, Microsoft, and Google have recently turned to nuclear power to meet their rising energy needs. However, "power purchase agreements" don't necessarily solve AI's energy problems.

Nvidia has seen its stock soar over 180% in the past year due to the AI surge and resulting spike in chip demand, while rival AMD recently began mass layoffs.

What's Your Reaction?

like

dislike

love

funny

angry

sad

wow