
With better reasoning ability comes even more of the wrong kind of robot dreams.
Remember when we reported a month ago or so that Anthropic had discovered that what’s happening inside AI models is very different from how the models themselves described their “thought” processes? Well, to that mystery surrounding the latest large language models (LLMs), along with countless others, you can now add ever worsening hallucination. And that’s according to the testing of the leading name in chatbots, OpenAI.
The New York Times reports that an OpenAI’s investigation into its latest GPT o3 and GPT o4-mini large LLMs found they are substantially more prone to hallucinating, or making up false information, than the previous GPT o1 model.
“The company found that o3 — its most powerful system — hallucinated 33 percent of the time when running its PersonQA benchmark test, which involves answering questions about public figures. That is more than twice the hallucination rate of OpenAI’s previous reasoning system, called o1. The new o4-mini hallucinated at an even higher rate: 48 percent,” the Times says.
“When running another test called SimpleQA, which asks more general questions, the hallucination rates for o3 and o4-mini were 51 percent and 79 percent. The previous system, o1, hallucinated 44 percent of the time.”
OpenAI has said that more research is required to understand why the latest models are more prone to hallucination. But so-called “reasoning” models are the prime candidate according to some industry observers.
“The newest and most powerful technologies — so-called reasoning systems from companies like OpenAI, Google and the Chinese start-up DeepSeek — are generating more errors, not fewer,” the Times claims.
In simple terms, reasoning models are a type of LLM designed to perform complex tasks. Instead of merely spitting out text based on statistical models of probability, reasoning models break questions or tasks down into individual steps akin to a human thought process.
OpenAI’s first reasoning model, o1, came out last year and was claimed to match the performance of PhD students in physics, chemistry, and biology, and beat them in math and coding thanks to the use of reinforcement learning techniques.
What is artificial general intelligence?: We dive into the lingo of AI and what the terms actually mean.
“Similar to how a human may think for a long time before responding to a difficult question, o1 uses a chain of thought when attempting to solve a problem,” OpenAI said when o1 was released.
However, OpenAI has pushed back against that narrative that reasoning models suffer from increased rates of hallucination. “Hallucinations are not inherently more prevalent in reasoning models, though we are actively working to reduce the higher rates of hallucination we saw in o3 and o4-mini,” OpenAI’s Gaby Raila told the Times.
Whatever the truth, one thing is for sure. AI models need to largely cut out the nonsense and lies if they are to be anywhere near as useful as their proponents currently envisage. As it stands, it’s hard to trust the output of any LLM. Pretty much everything has to be carefully double checked.
That’s fine for some tasks. But where the main benefit is saving time or labour, the need to meticulously proof and fact check AI output does rather defeat the object of using them. It remains to be seen whether OpenAI and the rest of the LLM industry can get a handle on all those unwanted robot dreams.