When it comes to training large language models, the usual refrain is that the problem is a GPU shortage. Nvidia’s dominant chips, of course, are the ones all the various AI outfits beat each other up to acquire.
But everyone’s favourite billionaire and tech prognosticator sees a different problem: We might not have enough power. Musk says the next Grok 3 generation of the AI model from his start up xAI is going to need around 100,000 of Nvidia’s H100 GPUs to train the model.
For sure, getting hold of 100,000 H100s isn’t going to be easy. Or cheap. But here’s the thing. Each H100 eats up a peak of 700W of power. So, that’s a peak of 70 megawatts for 100,000 of the things. OK, you’re probably not going to have all 100,000 running at 100% load at the same time. But then there’s more to an AI setup than just the GPUs. There’s all kinds of supporting hardware and infrastructure involved.
So, with 100,000 H100s, you’re looking at in excess of 100 megawatts or about the same as a small city. Or for another data point, in 2022 the whole of Paris had 500 megawatts worth of data centres.
So, yeah, 100 megawatts for just one LLM is a bit of a problem. And so in an interview with Norway wealth fund CEO Nicolai Tangen on X Spaces (via Reuters), Musk stressed that while the availability of GPUs was and will continue to be a major constraint for the development of AI models, access to sufficient electricity will increasingly become a limiting factor.
Oh, and Musk also predicted that AGI or artificial general intelligence will outpace human intelligence within two years. “If you define AGI (artificial general intelligence) as smarter than the smartest human, I think it’s probably next year, within two years,” Musk said.
But then he also predicted back in 2017 that self-driving cars reliable enough you could “go to sleep” in them were two years away. Still waiting on that one. And he predicted on March 19th 2020 that the US would have “close to zero new cases” of Covid19 by the end of April. Whoops!
Anyway, Musk’s slightly patchy techno-prognostications are not exactly news. But he probably has a pretty solid idea of how many GPUs his next-gen LLM is going to need to get trained up. So that city-sized power budget thing is likely for real, and a bit of a concern.
Moreover, Grok 2, xAI’s current model, apparently only needed 20,000 H100s. So that’s a five-fold increase in GPUs from one AI model to the next. That’s the kind of scaling that doesn’t feel terribly sustainable, whether its GPU count or power consumption.