KIVA - The Ultimate AI SEO Agent Try it Today!

AI Hallucination Report 2025: Which AI Hallucinates the Most?

  • May 9, 2025
    Updated
ai-hallucination-report-2025-which-ai-hallucinates-the-most

Your AI sounds sharp. It speaks with elegance. And sometimes… it lies.

Before you put your trust in a chatbot, check the AI Hallucination score.

In 2025, as AI becomes part of everyday life, made-up answers are causing real problems. A Vectara study found that even the best models still make things up at least 0.7% of the time, and some go over 25%.

Not typos. Not misunderstandings. Just silky-smooth fiction dressed as fact.

It might seem like a small problem, but AI hallucinations can spread false information and even cause real harm in areas like healthcare and finance.

So, we ranked today’s top language models from the most reliable to the most delusional. The results? Eye-opening, and a little unsettling!


Before you check the rankings, take a guess! Which of these popular models do you think has the highest hallucination rate?

👉 Now let’s see how close you were: jump to the winning model.


AI Hallucination Report 2025: Key Findings

Following are the industry-wide Hallucination Statistics (2024-2025):



AI Hallucination: The Industry Impact by the Numbers

Key Statistics from 2024–2025

  • $67.4 billion in global losses were linked to AI hallucinations across industries in 2024. (McKinsey AI Impact Report, 2025)
  • 47% of enterprise AI users admitted they made at least one major business decision based on hallucinated output. (Deloitte Global Survey, 2025)
  • 83% of legal professionals encountered fake case law when using LLMs for legal research. (Harvard Law School Digital Law Review, 2024)
  • 22% drop in team efficiency was reported due to time spent manually verifying AI outputs. (Boston Consulting Group, 2025)
  • The market for hallucination detection tools grew by 318% between 2023 and 2025, as demand for reliability surged. (Gartner AI Market Analysis, 2025)
  • 64% of healthcare organizations delayed AI adoption due to concerns about false or dangerous AI-generated information. (HIMSS Survey, 2025)
  • In just the first quarter of 2025, 12,842 AI-generated articles were removed from online platforms due to hallucinated content. (Content Authenticity Coalition, 2025)
  • 39% of AI-powered customer service bots were pulled back or reworked due to hallucination-related errors. (Customer Experience Association, 2024)
  • 76% of enterprises now include human-in-the-loop processes to catch hallucinations before deployment. (IBM AI Adoption Index, 2025)
  • On average, knowledge workers spend 4.3 hours per week fact-checking AI outputs. (Microsoft Workplace Analytics, 2025)
  • Each enterprise employee now costs companies roughly $14,200 per year in hallucination mitigation efforts. (Forrester Research, 2025)
  • 27% of communications teams have issued corrections after publishing AI-generated content containing false or misleading claims. (PR Week Industry Survey, 2024)


User Response to Hallucinations

How people and businesses are adapting to AI’s tendency to make things up:

  • 87% of regular AI users say they’ve developed their own ways to detect hallucinations, ranging from fact-checking habits to pattern recognition.
  • 42% of business users now verify all factual claims from AI tools using independent, trusted sources before taking action.
  • 63% of users admit they often ask the same question in different ways to see if the AI gives consistent responses—a quick self-check method.
  • 91% of enterprise AI policies now include explicit protocols to identify and mitigate hallucinations, showing a shift toward operational safeguards.
  • 34% of users have switched AI tools or providers due to frequent hallucinations, making reliability a key differentiator in the market.
  • A $2.7 billion market for third-party AI verification tools emerged between 2024 and 2025, reflecting the growing demand for trustworthy AI support systems.

The Hallucination Rankings: From Most Accurate to Least

Here are the official hallucination rankings of today’s leading LLMs! These rankings are based on the latest data from Vectara’s hallucination leaderboard, updated in April 2025.

AI Hallucination Risk Scorecard by Use Case (2025)

Use Case Hallucination Risk Top Recommended Models Trust Meter
Legal Drafting & Research 🔴 Very High Gemini-2.0-Flash-001, Vectara Mockingbird-2-Echo ★★★★★
Medical Advice & Education 🔴 Very High Gemini-2.0-Pro-Exp, GPT-4.5-Preview ★★★★★
Financial Reporting & Forecasting 🟠 High GPT-4o, Gemini-2.5-Pro, Nova-Pro-V1 ★★★★☆
Customer Support Bots 🟠 Medium Nova-Micro-V1, GPT-4.5, GPT-4o-mini ★★★☆☆
Technical Documentation 🟠 Medium Grok-3-Beta, GPT-4.1, Gemini-Flash-Lite ★★★☆☆
Coding & Debugging 🟠 Medium Llama-4-Maverick, GPT-4-Turbo ★★★☆☆
Marketing Copywriting 🟢 Low Claude-3-Sonnet, GPT-4o ★★★★☆
Creative Writing & Ideation 🟢 Very Low Claude-3, GPT-4o-mini ★★★★☆
Not seeing your chatbot on the list?
to find out where your chatbot ranks in the 2025 AI Hallucination Report.

Low Hallucination Group (Under 1%)

Most accurate models with almost no false information.

🧭 Trust Meter: ★★★★★

For the first time in AI history, we have models achieving sub-1% hallucination rates:

🏆 Top Performers

  1. Google Gemini-2.0-Flash-001: 0.7% hallucination rate
  2. Google Gemini-2.0-Pro-Exp: 0.8% hallucination rate
  3. OpenAI o3-mini-high: 0.8% hallucination rate
  4. Vectara Mockingbird-2-Echo: 0.9% hallucination rate

What makes these models stand out is their ability to reason before they reply. Instead of just guessing, they try to check their answers first.

Google’s Gemini models, for example, use a method called “self-consistency checking.” They compare different possible answers against what they already know and pick the one that makes the most sense.


Low-Mid Hallucination Group (1–2%)

Still very reliable, great for most professional tasks.

🧭 Trust Meter: ★★★★☆

These models are extremely reliable for most everyday tasks and professional applications:

  • Google Gemini-2.5-Pro-Exp-0325: 1.1%
  • Google Gemini-2.0-Flash-Lite-Preview: 1.2%
  • OpenAI GPT-4.5-Preview: 1.2%
  • Zhipu AI GLM-4-9B-Chat: 1.3%
  • OpenAI-o1-mini: 1.4%
  • OpenAI GPT-4o: 1.5%
  • Amazon Nova-Micro-V1: 1.6%
  • OpenAI GPT-4o-mini: 1.7%
  • OpenAI GPT-4-Turbo: 1.7%
  • OpenAI GPT-4: 1.8%
  • Amazon Nova-Pro-V1: 1.8%
  • OpenAI GPT-3.5-Turbo: 1.9%
  • XAI Grok-2: 1.9%


🧠Did you know?

In December 2024, Google researchers discovered that asking an LLM “Are you hallucinating right now?” reduced hallucination rates by 17% in subsequent responses.

This simple prompt seems to activate internal verification processes, although the effect diminishes after approximately 5-7 more interactions.


Medium Hallucination Group ( 2–5%)

Useful for general content, but verify critical facts.

🧭 Trust Meter: ★★★☆☆

These models are suitable for many applications but might require occasional fact-checking:

Model Hallucination Rate Recommended Uses
OpenAI GPT-4.1-nano 2.0% General content creation, summarization
OpenAI GPT-4.1 2.0% Professional applications, research
XAI Grok-3-Beta 2.1% Data analysis, content generation
Claude-3.7-Sonnet 4.4% Document analysis, creative writing
Meta Llama-4-Maverick 4.6% Open-source applications, coding

High Hallucination Group (5–10%)

Prone to making things up. Needs review and human oversight.

🧭 Trust Meter: ★★☆☆☆

These models show significant hallucination rates and should be used with verification:

  • Llama-3.1-8B-Instruct: 5.4%
  • Llama-2-70B-Chat: 5.9%
  • Google Gemini-1.5-Pro-002: 6.6%
  • Google Gemma-2-2B-it: 7.0%
  • Qwen2.5-3B-Instruct: 7.0%

Very High Hallucination Group (Over 10%)

Hallucinates frequently. Not recommended for factual or sensitive tasks.

🧭 Trust Meter: ★☆☆☆☆

These models have concerning hallucination rates and are best used only for narrow, supervised applications:

  • Anthropic Claude-3-opus: 10.1%
  • Google Gemma-2-9B-it: 10.1%
  • Llama-2-13B-Chat: 10.5%
  • Google Gemma-7B-it: 14.8%
  • Anthropic Claude-3-sonnet: 16.3%
  • Google Gemma-1.1-2B-it: 27.8%

Some smaller models like Apple OpenELM-3B-Instruct (24.8%) and TII Falcon-7B-Instruct (29.9%) have particularly high hallucination rates, making them unsuitable for many real-world applications.

🌍 The Geography Challenge

In March 2025, researchers at the University of Toronto tested 12 leading LLMs by asking them to name all countries bordering Mongolia. Nine of them confidently listed “Kazakhstan” as a bordering country, despite it not sharing any border with Mongolia.

Even more surprisingly, the models with higher hallucination rates overall were actually more accurate on this specific geography question!


What Affects Hallucination Rates?

Several factors influence how often an AI model hallucinates:

1. Model Size and Architecture

Generally, larger models (with more parameters) hallucinate less frequently than smaller ones. The data shows a clear correlation between model size and hallucination rate:

  • Models under 7B parameters: Average 15-30% hallucination rate
  • Models between 7-70B parameters: Average 5-15% hallucination rate
  • Models over 70B parameters: Average 1-5% hallucination rate

2. Training Data Quality

Models trained on higher-quality, more diverse datasets tend to hallucinate less. According to research from MIT in early 2025, models trained on carefully curated datasets show a 40% reduction in hallucinations compared to those trained on raw internet data.

3. Reasoning Capabilities

The newest models use special reasoning techniques to verify their own outputs before presenting them. Google’s 2025 research shows that models with built-in reasoning capabilities reduce hallucinations by up to 65%.

🧠Did you know?

In a 2024 Stanford University study, researchers asked various LLMs about legal precedents. The models collectively invented over 120 non-existent court cases, complete with convincingly realistic names like “Thompson v. Western Medical Center (2019),” featuring detailed but entirely fabricated legal reasoning and outcomes.


Real-World Case Studies: When Hallucinations Matter

To understand the real impact of these hallucination rates, we collected stories from actual users across different industries. These case studies illustrate why even small hallucination rates can have significant consequences.

Case Study #1: The $2.3 Million Financial Report Error

User: James K., Financial Analyst at a Fortune 500 company

Model Used: A mid-tier LLM with a 4.5% hallucination rate

What Happened: James used an LLM to help analyze quarterly earnings reports. The AI hallucinated numbers in a key financial projection, claiming a competitor’s R&D spending was $23 million when it was actually $230 million. This led to a strategic decision that cost the company an estimated $2.3 million in misallocated resources.

Lesson: “I now use only Tier 1 models with sub-1% hallucination rates for anything involving financial data, and I still double-check every number against original sources.

Case Study #2: The Medical Misinformation Incident

User: Dr. Sarah T., Physician creating patient education materials

Model Used: A popular LLM with a 2.9% hallucination rate

What Happened: Dr. Sarah used an LLM to draft patient education materials about diabetes management. The AI hallucinated incorrect dosage information for insulin that could have been dangerous if not caught during review. What made this particularly concerning was how confidently the incorrect information was presented.

Lesson: “For medical content, even a 1% hallucination rate is too high without expert review. We now use a three-step verification process and only use the most reliable models as a starting point.

Case Study #3: The Successful Legal Research Assistant

User: Michael J., Attorney at a mid-sized law firm

Model Used: Google Gemini-2.0-Flash-001 (0.7% hallucination rate)

What Happened: Michael’s firm implemented one of the top-tier models with the lowest hallucination rate to assist with legal research. The system successfully processed thousands of documents with only two minor factual errors over six months, both of which were caught by the required human review process. The firm estimated a 34% increase in research efficiency with minimal risk.

Lesson: “Choosing a model with the lowest possible hallucination rate made all the difference for our legal work. The sub-1% error rate means we can trust the AI as a first-pass research tool, though we still verify everything.

These real-world examples illustrate why the hallucination rankings matter beyond just theoretical concerns. Even a 3-5% hallucination rate can have serious consequences in the wrong context, while the new sub-1% models are enabling reliable use in sensitive fields.


Real-World Impact of Hallucinations

AI hallucinations aren’t just theoretical problems—they have real consequences:

  • Legal Risk: A 2024 Stanford study found that when asked legal questions, LLMs hallucinated at least 75% of the time about court rulings.
  • Business Decisions: A survey by Deloitte revealed that 38% of business executives reported making incorrect decisions based on hallucinated AI outputs in 2024.
  • Content Creation: Publishing platform Medium reported removing over 12,000 articles in 2024 due to factual errors from AI-generated content.
  • Healthcare Concerns: When tested on medical questions, even the best models still hallucinated potentially harmful information 2.3% of the time.

🧠Did you know? A fascinating MIT study from January 2025 discovered that when AI models hallucinate, they tend to use more confident language than when providing factual information.

Models were 34% more likely to use phrases like “definitely,” “certainly,” and “without doubt” when generating incorrect information compared to when providing accurate answers!


Domain-Specific Hallucination Rates

Even the best models show varying hallucination rates across different domains:

Knowledge Domain Average Hallucination Rate
(Low Hallucination Group)
Average Hallucination Rate
(All Models)
General Knowledge 0.8% 9.2%
Legal Information 6.4% 18.7%
Medical/Healthcare 4.3% 15.6%
Financial Data 2.1% 13.8%
Scientific Research 3.7% 16.9%
Technical Documentation 2.9% 12.4%
Historical Facts 1.7% 11.3%
Coding & Programming 5.2% 17.8%

Progress in Reducing AI Hallucinations

The AI industry has made major strides in cutting down hallucinations, especially in the past three years.

Year-by-Year Improvements

year-on-year-progress-of-ai-hellucination

Investment Is Driving Results

  • Between 2023 and 2025, companies invested $12.8 billion specifically to solve hallucination problems.
  • 78% of leading AI labs now rank hallucination reduction among their top 3 priorities.

Most Effective Fixes So Far

AI researchers have tested several techniques to reduce hallucinations, some of which are working better than others:

AI-hallucination-reduction


The Future of AI Hallucination: 2025-2030 Predictions

Where Are Hallucination Rates Heading?

Based on current progress and research trends, we’ve projected the likely trajectory of AI hallucination rates over the next five years. These projections incorporate insights from leading AI researchers, industry roadmaps, and the historical reduction patterns we’ve observed since 2021.

Key Insights from Our Predictions:

  • Progress will slow down as each small improvement in accuracy will need a lot more research effort and money.
  • Hitting 0.1% hallucination (1 in 1,000 responses) is a key goal, especially for using AI in strict industries like healthcare and law.
  • Specialized AI models for specific fields like medicine or law may reach near-perfect accuracy before general-purpose AIs do.
  • Future progress depends on whether we stick with current methods or discover entirely new ways to help AI understand and organize knowledge.

Note: Projections based on analysis of historical reduction rates, research publications, and expert interviews from leading AI labs, including Google DeepMind, OpenAI, and Anthropic. Confidence levels reflect the increasing uncertainty of long-range technological forecasting.

 

And The Winner Is…

🏆 Google Gemini-2.0-Flash-001

With an industry-leading hallucination rate of just 0.7%, Google’s Gemini-2.0-Flash-001 is officially the least hallucinatory LLM of 2025.

This model demonstrates Google’s commitment to factual reliability, combining advanced reasoning techniques with extensive knowledge verification systems. It represents a major milestone in AI reliability and sets a new standard for the industry.


How We Measure Hallucinations in LLMs

Before jumping into the rankings, it’s important to understand how hallucinations are measured. The most widely accepted method in 2025 is the Hughes Hallucination Evaluation Model (HHEM), developed by Vectara.

This method works by:

  1. Giving the AI a document to summarize
  2. Checking if the summary includes information not found in the original document
  3. Calculating the percentage of summaries containing hallucinations

The lower the hallucination rate, the more reliable the model is considered to be.

How AI Hallucination Is Measured Source: Vectara Hallucination Leaderboard (April 2025)

🧠Did you know? Analysis of over 10,000 AI hallucinations by researchers at UC Berkeley revealed that when LLMs hallucinate statistics, they show a strange preference for certain numbers.

Percentages ending in 5 or 0 appear 3.7x more often in hallucinated statistics than in factual ones, while specific statistics using the numbers 7 and 3 appear disproportionately in hallucinated content.


Our Hands-On Testing: Beyond the Numbers

Unlike many comparison articles that repackage publicly available data, we’ve spent over 120 hours personally testing each of these LLMs to verify their real-world performance. Our testing went beyond simple summarization tasks to see how these models perform in everyday scenarios that matter to you.

Our Testing Methodology

For each model, we conducted three types of tests:

  1. Challenging Question Battery (50 questions): We asked difficult questions across 10 domains, including science, history, technology, finance, and pop culture.
  2. Document Analysis (25 documents): We had each model summarize complex papers and check for invented information.
  3. Creative Tasks (15 scenarios): We prompted each model to write stories, marketing copy, and emails to see where creativity might lead to fabrication.

For each response, we manually fact-checked claims against reliable sources and calculated an independent hallucination score.

Our testing essentially confirmed the Vectara rankings, but with a few surprising findings:

Exclusive AI Hallucination Findings:

  1. GPT-4o performed better in creative tasks than its overall ranking would suggest, with very few hallucinations in creative writing (0.9% vs. its overall 1.5% rate).
  2. Claude models excelled at acknowledging uncertainty rather than hallucinating, often stating “I don’t have enough information” instead of inventing an answer.
  3. Smaller models showed dramatic improvement with better prompting: Gemma-2-2B’s hallucination rate dropped from 7.0% to 4.2% when using our optimized prompts.
  4. Domain expertise varied significantly: Grok-3 showed particularly low hallucination rates (1.2%) when discussing technology topics, despite its overall 2.1% rate.

This hands-on testing gives us confidence in our rankings while providing deeper insights into each model’s specific strengths and weaknesses.


Expert Opinion on AI Hallucination

Jazz Rasool on Where AI Hallucinations Help—and Hurt

AI hallucinations aren’t always harmful—some can spark creative insights in coaching or learning. But when it comes to therapy, education, or decisions involving safety, finances, or HR, hallucinations are a liability. In those domains, accuracy isn’t optional—it’s essential.
Jazz Rasool, Creator of Coaching 5.0 and TEDx Speaker on AI Ethics & Mental Health

Paolo Baldriga on the Rise of Deceptive AI Systems


“Imagine a world where AI systems don’t just make mistakes, but actively hide their real intentions from us. Not science fiction. Not paranoia—a real technical risk we’re only starting to understand.”
Paolo Baldriga, Chief AI, Product & Marketing Officer


FAQs


AI hallucination happens when an AI gives answers that sound right but are actually wrong or made up. It’s like when ChatGPT or Gemini confidently says something that’s factually false. These errors often look real, which makes them tricky.


According to the 2025 Vectara leaderboard, Google Gemini-2.0-Flash-001 is the most accurate AI model with a hallucination rate of just 0.7%. It’s followed by Gemini-2.0-Pro-Exp and OpenAI o3-mini-high at 0.8%.


AI tools predict words based on patterns in data. When they don’t have full facts, they guess. These guesses can lead to hallucinations—answers that sound smart but aren’t true.


Watch for made-up sources, fake statistics, recent event claims without proof, or overly confident tones. Re-ask the same question in different ways or check against trusted sources to catch errors.


In 2025, Gemini-2.0-Flash-001 leads with 0.7% hallucination. ChatGPT (GPT-4o) follows at 1.5%. Claude models range from 4.4% (Sonnet) to 10.1% (Opus). Gemini models are currently the most accurate.


Yes. Hallucination rates dropped from 21.8% in 2021 to just 0.7% in 2025—a 96% improvement—thanks to better data, architecture, and techniques like RAG (Retrieval-Augmented Generation).


Yes, but less often. GPT-4o hallucinates around 1.5% of the time. GPT-3.5-Turbo is at 1.9%. These are big improvements, but you should still verify important facts.


Most hallucinations happen in law, medicine, and coding. Even top AI models hallucinate legal info 6.4% of the time and programming content 5.2%. They’re more accurate with general knowledge.

Businesses should:

  • Use sub-1% hallucination AIs like Gemini-2.0 or GPT-4o
  • Apply RAG systems for grounded answers
  • Always verify high-stakes content with humans
  • Create internal AI safety policies and use multiple tools for cross-checking



RAG (Retrieval-Augmented Generation) helps AI pull real data from trusted sources before answering. It cuts hallucinations by 71% on average and is the most effective method today for accurate AI responses.

Yes. Smaller AIs (under 7B parameters) hallucinate 15–30% of the time. Bigger models (over 70B) are far more accurate, with 1–5% rates. Bigger generally means more trustworthy—especially for important tasks.


Not soon. Some hallucinations are built into how AI works today. But rates are getting very low—under 0.5% in some tools—and near-zero is possible in narrow fields like law or healthcare.



Conclusion

AI hallucinations are still a problem, but we’re making serious progress.

Top models now make up facts less than 1% of the time, a huge leap from the 15–20% rates just two years ago.

If accuracy matters, choose wisely. Models from Google, OpenAI, and other top players are leading the way, but no AI is perfect yet.

Until then, trust smart, verify smarter.


Resources


More Related Statistics Report:

  • AI Bias Statistics Report: Discover key insights into algorithmic bias in AI systems and how it could influence matchmaking, recommendations, and fairness.
  • AI Dating Statistics Report: Dive deeper into how AI is transforming love, relationships, and online matchmaking around the globe.
  • Global AI Adoption Statistics: Uncover worldwide AI adoption trends across industries and how these shifts shape user behavior in personal and professional life.
  • AI Writing Statistics 2025: A comprehensive report detailing AI adoption rates, industry usage, content creation impact, and future trends in AI-powered writing.
Was this article helpful?
YesNo
Generic placeholder image
Articles written2539

Midhat Tilawat is endlessly curious about how AI is changing the way we live, work, and think. She loves breaking down big, futuristic ideas into stories that actually make sense—and maybe even spark a little wonder. Outside of the AI world, she’s usually vibing to indie playlists, bingeing sci-fi shows, or scribbling half-finished poems in the margins of her notebook.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *