Key Takeaways
• OpenAI’s internal tests show GPT-4 mini hallucinates nearly 80% of the time on certain factual tasks.
• Newer AI models, designed for more complex reasoning, are producing more false information than older versions.
• OpenAI has not identified the cause of increased hallucination rates in its most advanced models.
• The issue raises new concerns about the trustworthiness and utility of AI tools in professional and everyday use.
OpenAI’s most advanced artificial intelligence models meant to usher in a new era of human-like reasoning are also the most prone to generating false information.
According to OpenAI’s internal tests, hallucination rates in newer models like GPT-3 (o3) and GPT-4 mini (o4-mini) have increased significantly compared to earlier systems.
This troubling trend is emerging just as these models are being adopted more widely in education, customer support, research, and even coding.
While OpenAI’s goal was to improve reasoning and reduce errors, the results suggest the opposite is happening.
Hallucination Rates Are Climbing, Not Falling
OpenAI tested its models using two benchmarks: PersonQA, which involves answering questions about public figures, and SimpleQA, a general knowledge test. The results raise red flags about factual accuracy.
• GPT-3 hallucinated 33% of the time on PersonQA; GPT-4 mini reached 48%
• On SimpleQA, GPT-3 hallucinated 51% of the time, while GPT-4 mini scored 79%
• The earlier GPT-1 model, by comparison, hallucinated 44% on SimpleQA
The term “hallucination” in AI refers to the generation of plausible-sounding but factually incorrect or entirely fabricated information. This makes it difficult for users to know when to trust the system’s output—especially when it delivers responses with confidence.
Designed to Reason—But Struggling with Reality
These newer chatgpt ai models are part of OpenAI’s initiative to build what it calls “reasoning systems.” Unlike earlier models that focused on statistical pattern recognition, reasoning models attempt to solve problems by breaking them into logical steps.
“Similar to how a human may think for a long time before responding to a difficult question, o1 uses a chain of thought when attempting to solve a problem.”
— OpenAI, on GPT-1
While that approach should, in theory, lead to more reliable outcomes, the reality appears far more complex. The increase in hallucination rates suggests that more powerful reasoning may be entangled with a higher risk of error.
OpenAI Responds to the Findings
OpenAI has acknowledged the hallucination issue but cautions against drawing a direct link between model complexity and factual inaccuracy. The company emphasizes that the problem is being actively investigated.
“Hallucinations are not inherently more prevalent in reasoning models, though we are actively working to reduce the higher rates of hallucination we saw in o3 and o4-mini.”
— Gaby Raila, OpenAI
OpenAI has not yet provided a technical explanation for why the hallucination problem is worsening. The company continues to refine its training processes and benchmarks in hopes of improving output quality.
Why It Matters
The hallucination issue affects both user trust and the practical utility of AI tools in professional settings. If users must double-check everything the model says, the time-saving benefit of using AI tools becomes negligible.
• Misleading or incorrect information can harm decisions in education, law, health, and business
• Increased hallucination rates undermine the model’s reliability as a research or writing assistant
• Without significant improvements, adoption of AI for high-stakes tasks may stall or backfire
The current environment calls for caution. While large language models continue to evolve in capability, their reliability remains inconsistent—and unpredictably so.
OpenAI’s most capable AI models are also its least accurate when it comes to factual consistency. This paradox exposes a key challenge for the future of generative AI: balancing intelligence with trust.
“Trust me, hallucinations aren’t just random bugs, they’re built right into the way these models work.”
As the technology becomes more embedded in everyday tools and workflows, solving the hallucination problem isn’t optional it’s essential. For now, users must remain skeptical and verify outputs carefully, regardless of how intelligent the answers may seem.
For more news and insights, visit AI News on our website.