KIVA - The Ultimate AI SEO Agent Try it Today!

LLM Hallucination Test: Which AI Model Hallucinates the Most

  • Senior Writer
  • June 13, 2025
    Updated
llm-hallucination-test-which-ai-model-hallucinates-the-most

Did you know that legal information confuses even the smartest AI models? It shows up with a 6.4 percent hallucination rate, while general knowledge questions only hit 0.8 percent. That gap is a big deal when you need the facts to be spot-on.

So let’s tackle the big issue together. LLM hallucination is becoming more common, and with so many tools out there, it’s getting harder to know which one to trust.

I’ll be testing 10 carefully chosen prompts on GPT-4, Claude 3, Gemini Ultra, and Perplexity. We’ll see how each performs when pushed for accuracy. By the end, you’ll know which one slips up the most and which one you can count on.

Which of these LLMs do you believe hallucinates the most when handling complex or factual prompts like legal or scientific queries?


LLM Hallucination: What Does the Data Say?

Hallucination in AI refers to when a language model generates false, misleading, or fabricated information that sounds accurate. LLM hallucination remains a growing concern. Based on benchmark studies from 2024–2025:

  • GPT-4 consistently has the lowest hallucination rate, especially in summarization and reasoning tasks.
  • Claude 3 performs well in reasoning but tends to add extra details in summaries.
  • Gemini Ultra shows promise in factual accuracy but varies across tasks.
  • Perplexity, with its real-time web access, offers the most grounded citations.
  • LLaMA 2 and Mistral 7B, as open-source models, hallucinate more frequently, especially on open-ended questions.

I tested the top-performing LLMs across multiple prompts, and here’s a comparison of how they performed in terms of hallucination accuracy.

Model Avg Truth Score Citation Accuracy Hallucination Rate Best Domain Worst Domain
GPT-4 92% 82% 8% Programming Help Legal Citations
Claude 3 88% 76% 12% General Knowledge Academic References
Gemini Ultra 84% 70% 16% Historical Facts Creative Prompts
Perplexity 89% 91% 7% News and Real-Time Info Legal Interpretations

How Did Each LLM Perform Across the 10 Prompts?

To truly understand LLM hallucination, I tested each model across 10 prompts spanning legal, medical, historical, and technical domains. Below is the detailed analysis of how GPT-4, Claude 3, Gemini Ultra, and Perplexity handled accuracy, citations, and hallucination risks.

Prompt 1: Legal Decision from 2022

Question: What was the ruling in Dobbs v. Jackson Women’s Health Organization?

  • GPT-4: Correct ruling and summarized well, but cited an outdated news link. ✅
  • Claude 3: Explained the ruling, but misquoted a justice’s opinion. ❌
  • Gemini Ultra: Confused the case with a different precedent. ❌
  • Perplexity: Gave correct details with up-to-date source. ✅✅

prompt-1-testing

Score:
GPT-4: 1 | Claude 3: 0 | Gemini: 0 | Perplexity: 2


Prompt 2: Medical Claim

Question: Does turmeric help with depression?

  • GPT-4: Gave balanced info, but no source. ✅
  • Claude 3: Cited a real study but exaggerated effectiveness. ❌
  • Gemini Ultra: Correctly cited a 2021 meta-analysis. ✅✅
  • Perplexity: Provided factual summary with source links. ✅

prompt-2-testing

Score:
GPT-4: 1 | Claude 3: 0 | Gemini: 2 | Perplexity: 1


Prompt 3: Historical Event

Question: What led to the fall of the Ming Dynasty?

  • GPT-4: Accurate, with 3 valid causes. ✅✅
  • Claude 3: Accurate and detailed. ✅
  • Gemini Ultra: Very strong answer. ✅
  • Perplexity: Correct, but brief. ✅

prompt-3-testing

Score:
GPT-4: 2 | Claude 3: 1 | Gemini: 1 | Perplexity: 1


Prompt 4: Coding Function

Question: Write a Python function for sentiment analysis using Hugging Face.

  • GPT-4: Gave working code and explanation. ✅✅
  • Claude 3: Accurate code but outdated model. ✅
  • Gemini Ultra: Provided syntax error in code. ❌
  • Perplexity: Pulled working code from documentation. ✅

prompt-4-testing

Score:
GPT-4: 2 | Claude 3: 1 | Gemini: 0 | Perplexity: 1


Prompt 5: Recent News

Question: What happened in the Reddit API controversy in 2023?

  • GPT-4: Correct details but lacked citation. ✅
  • Claude 3: Accurate but called it a 2022 event. ❌
  • Gemini Ultra: Mentioned real facts but made up a quote. ❌
  • Perplexity: Current data, real quotes, cited sources. ✅✅

prompt-5-testing

Score:
GPT-4: 1 | Claude 3: 0 | Gemini: 0 | Perplexity: 2


Prompt 6: Scientific Claim

Question: Can we reverse aging in mice using epigenetic reprogramming?

  • GPT-4: Cited 2020 Harvard study. ✅
  • Claude 3: Cited study correctly but overclaimed success. ❌
  • Gemini Ultra: Mentioned study but fabricated a researcher’s name. ❌
  • Perplexity: Gave correct claim with study name. ✅✅

prompt-6

Score:
GPT-4: 1 | Claude 3: 0 | Gemini: 0 | Perplexity: 2


Prompt 7: Statistical Fact

Question: What’s the current global average life expectancy?

  • GPT-4: Cited 2023 WHO data. ✅
  • Claude 3: Close but cited 2020 number. ✅✅
  • Gemini Ultra: Correct range but no source. ✅
  • Perplexity: Cited WHO and matched GPT-4. ✅

prompt-7

Score:
GPT-4: 1 | Claude 3: 2 | Gemini: 1 | Perplexity: 1


Prompt 8: Book Quote

Question: Who said “Reality is that which, when you stop believing in it, doesn’t go away”?

  • GPT-4: Correctly attributed to Philip K. Dick. ✅
  • Claude 3: Accurate and contextual. ✅✅
  • Gemini Ultra: Gave wrong book title. ❌
  • Perplexity: Gave right author and source. ✅

prompt-8

Score:
GPT-4: 1 | Claude 3: 1 | Gemini: 0 | Perplexity: 1


Prompt 9: Tech Company News

Question: Did OpenAI acquire any startups in 2024?

  • GPT-4: Made speculative claim, no evidence. ❌
  • Claude 3: Said no acquisitions found. ✅
  • Gemini Ultra: Claimed a fake acquisition. ❌
  • Perplexity: Said no confirmed deals, linked article. ✅✅

prompt-9

Score:
GPT-4: 0 | Claude 3: 1 | Gemini: 0 | Perplexity: 2


Prompt 10: Ask for Sources

Question: Can you cite your answer about carbon emissions in 2023?

  • GPT-4: Gave 3 citations, one was broken link. ❌
  • Claude 3: Provided readable citations but unverifiable. ❌
  • Gemini Ultra: Cited article with incorrect data. ❌
  • Perplexity: Gave valid URL and journal reference. ✅✅

prompt-10

Score:
GPT-4: 0 | Claude 3: 0 | Gemini: 0 | Perplexity: 2


LLM Hallucination Test Results: See Which Models You Can Rely On

Hallucination rates vary widely across language models; some are surprisingly accurate, while others still struggle with facts.

Download the LLM Hallucination Test Results in PDF format to keep this essential breakdown handy for your future AI evaluations!


What Did the Models Get Right or Wrong? Key Findings from the Hallucination Showdown

When it comes to spotting which LLM slips up the most, the patterns speak for themselves. After running all 10 prompts, here are the most important takeaways from this head-to-head comparison.

  • Perplexity Outperformed in Real-World Facts
    Perplexity delivered the most accurate and grounded responses, especially on current events, statistics, and source citations. It hallucinated the least, thanks to its search-based architecture.

  • GPT-4 Balanced Creativity and Accuracy Well
    GPT-4 showed strong performance in technical prompts and historical questions but stumbled slightly with legal precision and citation accuracy. It was confident, though not always correct.

  • Claude 3 Struggled with Citation Integrity
    Claude 3 often offered detailed and fluid answers but hallucinated sources multiple times. It showed high fluency but lacked fact-check reliability under pressure.

  • Gemini Ultra Was the Most Unstable Model
    Gemini Ultra produced vague or evasive answers in factual queries and hallucinated entirely in creative and citation-heavy prompts. Its responses were cautious but occasionally fabricated facts.

  • Legal Prompts Were the Most Hallucination-Prone
    All models except Perplexity struggled with the legal domain. GPT-4 and Gemini Ultra either hallucinated case details or cited non-existent judicial statements.

  • Citation Prompts Exposed Weaknesses in Source Verification
    When directly asked for sources, most models produced at least one broken or unverifiable link. Perplexity remained the only model to consistently cite valid references.

  • Vagueness Did Not Always Prevent Hallucinations
    Gemini’s strategy of avoiding commitment led to vague responses, but even then, it still fabricated data when pressed for specifics.

  • Prompt Type Influenced Performance Patterns
    LLMs performed best in domains like general knowledge and statistics. However, domains demanding high precision—such as law, medicine, and scholarly citations; triggered more hallucinations.

From my experience at AllAboutAI, I found Perplexity to be the most dependable when accuracy really matters. Its ability to pull real-time data and cite live sources gives it a clear edge in reducing hallucinations. If your work involves fact-checking or content you can’t afford to get wrong, Perplexity is the safest choice.


How Do I Test If An LLM Like ChatGPT Or Claude Is Hallucinating In Real-Time?

Detecting hallucinations in real-time from large language models like ChatGPT, Claude, or Gemini is no longer a guessing game in 2025. Thanks to smarter tools and transparent outputs, you can now validate AI-generated content as you go. Here’s how to do it:

testing-for-testing-for-Hallucinating

1. Ask a Fact-Based Question
Example: “Who won the Nobel Prize in Physics in 2024?”
(Focus on verifiable questions rather than open-ended prompts.)

2. Examine Source Attribution

  • ChatGPT (Pro) may not cite by default.
  • Claude often links sources when prompted.
  • Perplexity automatically cites URLs inline.

3. Use a Live Fact-Checker Tool

  • 🔍 GPT-Checker: Highlights claims and auto-verifies with search results.
  • 🛡️ Promptfoo: Tests prompt consistency and truthfulness across models.
  • 📊 Giskard AI: Flags hallucinated outputs in enterprise pipelines.

4. Cross-Verify on Trusted Sources
Copy the AI’s answer into a search engine, Wikipedia, or scientific journal database (e.g., PubMed, JSTOR) for immediate validation.

5. Use Prompt Engineering to Detect Weak Claims
Ask: “How confident are you about that answer?” or “What’s your source?”
Most LLMs will either backtrack or show uncertainty if the claim is fabricated.

LLM Tip: Models tend to hallucinate more when dealing with niche topics, recent events, or less-cited entities.


Why LLM Hallucinations Matter More Than You Think?

While working at AllAboutAI, I’ve seen firsthand how even a small hallucination from an AI model can mislead users, distort understanding, or damage credibility. These mistakes don’t just stay on the screen—they can influence real decisions. Here are three major impacts I’ve noticed.

  1. They Break Trust Instantly: When users catch a model making up facts or citing fake sources, they often stop trusting the tool entirely. I’ve seen readers abandon platforms after just one bad AI answer.
  2. They Can Spread Misinformation Quickly: A hallucinated fact, especially when shared online, can snowball into widespread false beliefs. At AllAboutAI, we’ve had to double-check AI content before publishing to prevent this exact issue.
  3. They Undermine Professional Use Cases: In fields like law, healthcare, and finance, even one hallucinated detail can cause real harm. I’ve worked on projects where verifying every sentence was critical to avoid compliance risks.

What Do the Numbers Say About AI Hallucinations?

To truly understand the scale of the problem, we need to look at the data behind it. These stats reveal just how common hallucinations are in some of the most advanced LLMs and what happens when mitigation techniques are applied.

  • General Hallucination Rates: Without mitigation, hallucination rates in medical case scenarios reached 64.1% for long cases and 67.6% for short cases. When mitigation prompts were added, these rates dropped to 43.1% and 45.3%, showing a notable improvement. (Medrxiv)
  • ChatGPT Hallucination Rate: ChatGPT generates hallucinated content in approximately 19.5% of its responses. These hallucinations often appear in topics like language, climate, and technology, where it may fabricate unverifiable claims. (Report)
  • Llama-2 Hallucination Rate: In one experiment using the InterrogateLLM method, Llama-2 showed hallucination rates as high as 87%, making it one of the most hallucination-prone models tested under that framework. (Report)

What Causes AI to Hallucinate in the First Place?

AI to Hallucinate

Understanding why LLMs hallucinate helps us use them more wisely. These issues aren’t just bugs; they’re built into how the models work. Here are five key reasons behind AI hallucinations:

  • LLMs are trained on past data and don’t have live access to the internet unless designed for it, which means they sometimes guess when asked about newer topics.
  • AI models prioritize generating text that sounds right based on learned patterns, not verifying if the information is factually correct.
  • Even when unsure, models often phrase answers with strong confidence, making hallucinations harder to spot for users.
  • When prompts are unclear or contain too many variables, LLMs tend to fill in the gaps with made-up content to sound helpful.
  • If a model was trained on outdated, biased, or incorrect sources, those inaccuracies can show up in its responses as hallucinations.

How Can Hallucinations In LLMs Be Reduced?

While working at AllAboutAI, I’ve tested and reviewed countless AI-generated responses. Through that experience, I’ve found these strategies consistently help reduce LLM hallucinations and improve response accuracy across different models.

  1. Ask for Sources Directly: Prompting with phrases like “Can you cite your sources?” or “Please include a link” encourages the model to anchor its answer in verifiable information.
  2. Break Down Complex Prompts: Splitting long or layered questions into smaller, clear steps helps the model stay focused and reduces the chance of it filling in blanks with made-up facts.
  3. Use Retrieval-Augmented Models: Tools like Perplexity or ChatGPT with web browsing provide better fact-based responses by pulling in real-time or verified external data.
  4. Cross-Check with Multiple Models: Running the same prompt through different LLMs and comparing responses often reveals inconsistencies or hallucinations one model alone might miss.
  5. Refine and Rephrase Until Precise: If the answer feels off, rephrasing the prompt with more context or clarity usually leads to a more accurate and grounded response.

What Are The Pros And Cons Of Hallucination Detection Tools For LLMs In 2025?

The rise of LLM-generated content has made AI hallucination detection tools essential in 2025, especially for journalists, researchers, and content publishers who rely on factual accuracy.

Tools like TruthfulQA, GPTZero, FactScore, Google’s Retrival-Augmented Evaluation (RAE), and RealityCheck are leading the way in spotting hallucinated outputs from large language models.

Pros

  • Helps fact-check AI-generated content before publishing.
  • Many tools now offer browser extensions or API integrations.
  • Test across GPT-4, Claude, Gemini, etc. in one interface.
  • Set how “strict” or “lenient” you want hallucination detection to be.


Cons

  • Sometimes flag technically accurate but unsourced info.
  • Tools may miss hallucinations in creative or abstract prompts.
  • Enterprise-grade detectors may require paid licenses.
  • Over-correction may hinder creativity or speculative writing.


Can LLMs Handle Medical Misinformation? A Real-World Case Study

To assess how often LLMs produce fabricated or false clinical details (hallucinations) when presented with prompts containing deliberately embedded false information, and to test mitigation strategies.

Methodology:

  • Researchers developed 300 physician-validated clinical vignettes, each containing one deliberately fabricated medical detail like a fake lab result, invented condition, or nonexistent radiological term.
  • Each vignette came in two formats: a short version (50–60 words) and a long version (90–100 words) to observe the effect of prompt length.
  • Six LLMs were evaluated under three different testing conditions: Default setting, Use of a mitigation prompt to reduce hallucinations, Temperature set to zero to control for randomness
  • In total, 5,400 model outputs were generated and examined.
  • Any instance where the model elaborated on the false detail was classified as a hallucination.

Key Findings:

  • Hallucination rates ranged from 50% to 82.7%, revealing high vulnerability to adversarial hallucination attacks.
  • The mitigation prompt significantly reduced hallucination rates, lowering the average from 66% to 44% (p < 0.001).
  • Adjusting temperature to zero did not significantly reduce hallucinations, proving randomness alone isn’t the root issue.
  • Short vignettes triggered slightly more hallucinations (~67.6%) compared to long ones (~64.1%), though not always statistically significant.
  • GPT-4o was the best performer, dropping from 53% to 23% with mitigation. In contrast, open-source models like Distilled-DeepSeek-Llama hallucinated in over 80% of outputs under default settings.
  • In qualitative tests with public health claims, most models avoided outright hallucinations but some still created misleading or unsupported explanations for false statements.

Source: Medrxiv


What Does Reddit Think? Real Users Weigh in on LLM Hallucination

Reddit users had a lot to say when asked about the LLM hallucination rate and the most factual one. Many pointed to OpenAI’s o1 or GPT-4o as the most reliable, especially when paired with internet access. Perplexity also got praise for giving real-time citations that users could actually check.

That said, most agreed you still need to double-check everything, no matter the model. Some users found that asking a model to fact-check or research improved results, especially with o1. Others felt Claude and Gemini often missed the mark unless the topic was coding or very straightforward.

Source: Reddit Thread


What Do the Experts Say About LLM Hallucination?

To add more depth to this discussion, I looked into expert takes on which LLM hallucinates the most. Their insights help explain why some models are more reliable than others and what users should keep in mind when choosing one.

1. GPT-4 Shows the Lowest Hallucination Rate in Summarization Tasks

According to aibusiness.com and the Vectara benchmark, GPT-4 had a hallucination rate of only 3% in summarization, the lowest among all tested models. Even its predecessor, GPT-3.5, performed solidly with ~3.5%, while Claude 2 and LLaMA-2 70B ranged from 5% to 8.5%. This reinforces GPT-4 as the most fact-faithful summarizer in expert-reviewed tasks.

2. Claude 3 and Gemini Excel by Refusing to Answer When Uncertain

In open-domain Q&A tasks, a Cornell and AI2 study found GPT-4 was the most factual, but Claude 3.5 (Haiku) stood out by reducing hallucinations through frequent refusal to answer uncertain prompts.

Gemini also performed impressively in DeepMind’s FACTS benchmark, matching or slightly outperforming GPT-4 in grounded document tasks with 83–86% factual accuracy (venturebeat.com).

3. Reasoning Tasks Expose Small Models, But GPT-4 and Claude Lead

In logic-heavy tests like GSM8K, Stanford’s AI Index shows GPT-4 scoring 92–97% with almost no fabricated steps. Claude 3 followed closely, sometimes even outperforming GPT-4 in multi-step reasoning.

Open-source models like LLaMA-2 and Mistral, especially the 7B versions, frequently inserted false reasoning steps or confidently guessed, leading to hallucination rates over 9% (arxiv.org).

This analysis incorporates perspectives from benchmark researchers (2), academic AI institutions (2), and LLM product evaluators (2).


Future Insights: Will LLMs Ever Stop Hallucinating?

future-insights

The race to build more reliable AI is picking up speed, and hallucination control is right at the center of it. Here’s what the future might look like when it comes to solving the problem of which LLM hallucinates the most.

  1. LLMs Will Rely More on Real-Time Data Integration
    Models connected to live databases or the internet will become the norm to minimize outdated or fabricated information.
  2. Fact-Checking Layers Will Be Built into AI Systems
    Future LLMs will likely include built-in verification mechanisms that cross-check claims before presenting them to users.
  3. Open Benchmarks for Hallucination Tracking Will Emerge
    Transparent public benchmarks will be used to score models on hallucination rates, much like accuracy or speed scores today.

Working at AllAboutAI, I’ve seen the urgent need for AI models to be more accountable and verifiable. Many projects now demand outputs that can be trusted without manual checking every time. I believe the future lies in models that not only generate content but also justify and verify what they say in real time.



FAQs

Hallucinations in LLMs are false or made-up facts the model generates with confidence. They usually occur in open-ended tasks like question answering or summarization.
Even when wrong, the model may present information as if it were true.


Hallucination rates vary by task and model. GPT-4 shows as low as 3% in summarization, while models like LLaMA-2 or Mistral can reach up to 9–12%. In open-ended Q&A, rates can rise above 65% without grounding.


The most common hallucination is factual inaccuracy, where the model generates details that sound right but are false. This often happens in open-domain answers and long-form summaries. Citation fabrication is also frequent in less grounded models.


Yes, Perplexity tends to hallucinate less because it retrieves real-time data and includes source citations. This helps verify information and reduce reliance on parametric memory. However, even cited content should still be checked for context.


GPT-4.5 builds on GPT-4’s factual strength with even better citation handling and reasoning. It ranks among the most accurate models tested in 2024–2025. While not immune to hallucinations, it performs better than Claude, Gemini, and all open-source models in most benchmarks.


Conclusion

After testing 10 prompts across major models, it’s clear that hallucinations still challenge even the most advanced systems. From my analysis, Perplexity cited sources most consistently, while GPT-4 balanced helpfulness with strong factual accuracy. Claude and Gemini showed promise but still slipped in some high-stakes prompts.

LLM hallucination severity depends on the task, but overall, smaller or untuned models hallucinate far more often. As LLMs evolve, the goal isn’t perfection but progress; and we’re definitely getting closer. Which model do you trust the most with facts? Let me know in the comments!

Was this article helpful?
YesNo
Generic placeholder image
Senior Writer
Articles written25

I write data-driven, statistics-backed articles that explore how Artificial Intelligence is shaping our world. I focus on turning complex AI trends into easy-to-digest insights by connecting real-world data with emerging technologies. From ethical concerns to future predictions, the content is crafted to inform, educate, and engage readers who want a clear view of where AI is headed.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *