See How Visible Your Brand is in AI Search Get Free Report

ChatGPT Hallucinations Are Getting Worse—and Even OpenAI Doesn’t Know Why

  • September 8, 2025
    Updated
chatgpt-hallucinations-are-getting-worse-and-even-openai-doesnt-know-why

Key Takeaways

• OpenAI’s internal tests show GPT-4 mini hallucinates nearly 80% of the time on certain factual tasks.

• Newer AI models, designed for more complex reasoning, are producing more false information than older versions.

• OpenAI has not identified the cause of increased hallucination rates in its most advanced models.

• The issue raises new concerns about the trustworthiness and utility of AI tools in professional and everyday use.


OpenAI’s most advanced artificial intelligence models meant to usher in a new era of human-like reasoning are also the most prone to generating false information.

According to OpenAI’s internal tests, hallucination rates in newer models like GPT-3 (o3) and GPT-4 mini (o4-mini) have increased significantly compared to earlier systems.

This troubling trend is emerging just as these models are being adopted more widely in education, customer support, research, and even coding.

While OpenAI’s goal was to improve reasoning and reduce errors, the results suggest the opposite is happening.


Hallucination Rates Are Climbing, Not Falling

OpenAI tested its models using two benchmarks: PersonQA, which involves answering questions about public figures, and SimpleQA, a general knowledge test. The results raise red flags about factual accuracy.


• GPT-3 hallucinated 33% of the time on PersonQA; GPT-4 mini reached 48%
• On SimpleQA, GPT-3 hallucinated 51% of the time, while GPT-4 mini scored 79%
• The earlier GPT-1 model, by comparison, hallucinated 44% on SimpleQA

The term “LLM hallucination” refers to the generation of plausible-sounding but factually incorrect or entirely fabricated information. This makes it difficult for users to know when to trust the system’s output—especially when it delivers responses with confidence.


Designed to Reason—But Struggling with Reality

These newer chatgpt ai models are part of OpenAI’s initiative to build what it calls “reasoning systems.” Unlike earlier models that focused on statistical pattern recognition, reasoning models attempt to solve problems by breaking them into logical steps.


“Similar to how a human may think for a long time before responding to a difficult question, o1 uses a chain of thought when attempting to solve a problem.”
— OpenAI, on GPT-1

While that approach should, in theory, lead to more reliable outcomes, the reality appears far more complex. The increase in hallucination rates suggests that more powerful reasoning may be entangled with a higher risk of error.


OpenAI Responds to the Findings

OpenAI has acknowledged the hallucination issue but cautions against drawing a direct link between model complexity and factual inaccuracy. The company emphasizes that the problem is being actively investigated.


“Hallucinations are not inherently more prevalent in reasoning models, though we are actively working to reduce the higher rates of hallucination we saw in o3 and o4-mini.”
— Gaby Raila, OpenAI

OpenAI has not yet provided a technical explanation for why the hallucination problem is worsening. The company continues to refine its training processes and benchmarks in hopes of improving output quality.


Why It Matters

The hallucination issue affects both user trust and the practical utility of AI tools in professional settings. If users must double-check everything the model says, the time-saving benefit of using AI tools becomes negligible.


• Misleading or incorrect information can harm decisions in education, law, health, and business
• Increased hallucination rates undermine the model’s reliability as a research or writing assistant
• Without significant improvements, adoption of AI for high-stakes tasks may stall or backfire

The current environment calls for caution. While large language models continue to evolve in capability, their reliability remains inconsistent—and unpredictably so.

OpenAI’s most capable AI models are also its least accurate when it comes to factual consistency. This paradox exposes a key challenge for the future of generative AI: balancing intelligence with trust.


“Trust me, hallucinations aren’t just random bugs, they’re built right into the way these models work.”

As the technology becomes more embedded in everyday tools and workflows, solving the hallucination problem isn’t optional it’s essential. For now, users must remain skeptical and verify outputs carefully, regardless of how intelligent the answers may seem.

ChatGPT Mobile UI Just Got a Major Overhaul

Alibaba Launches Qwen3: A Scalable Open-Source AI Suite to Rival OpenAI and Google

OpenAI Reverts GPT-4o After Detecting Sycophantic ChatGPT Behavior

For more news and insights, visit AI News on our website.

Was this article helpful?
YesNo
Generic placeholder image
Articles written 859

Khurram Hanif

Reporter, AI News

Khurram Hanif, AI Reporter at AllAboutAI.com, covers model launches, safety research, regulation, and the real-world impact of AI with fast, accurate, and sourced reporting.

He’s known for turning dense papers and public filings into plain-English explainers, quick on-the-day updates, and practical takeaways. His work includes live coverage of major announcements and concise weekly briefings that track what actually matters.

Outside of work, Khurram squads up in Call of Duty and spends downtime tinkering with PCs, testing apps, and hunting for thoughtful tech gear.

Personal Quote

“Chase the facts, cut the noise, explain what counts.”

Highlights

  • Covers model releases, safety notes, and policy moves
  • Turns research papers into clear, actionable explainers
  • Publishes a weekly AI briefing for busy readers

Related Articles

Leave a Reply