Hallucination

Definition

A hallucination is output from a large language model (LLM) that is fluent and confident-sounding but factually incorrect — invented citations, fabricated numbers, plausible-but-wrong claims, or misattributed quotes.

Hallucinations are a structural feature of how LLMs work, not a bug to be patched. Models predict the next plausible token; plausibility doesn't track truth. The model that confidently cites a non-existent court case isn't broken — it's working exactly as designed, generating fluent text in the shape of a citation. Improvements in training, prompting, and tool-use reduce hallucination but don't eliminate it.

The mitigation playbook is well understood. Retrieval-augmented generation (RAG) grounds answers in real documents. Tool use lets the model query authoritative sources. Confidence thresholds and citation requirements force the model to reveal when it's uncertain. Production AI systems combine all three — and still need human review for high-stakes output.

Origin

The term has been used in NLP research since the 2010s; mainstream usage exploded after ChatGPT's launch in November 2022. Andrej Karpathy and others have argued the term is misleading (it implies a malfunction; the behaviour is actually working-as-designed) but the language has stuck.

How it works

Architect for grounding: use RAG to give the model real source documents instead of relying on training data.
Use tool use: let the model query a database or API rather than recall facts.
Force citations: require every factual claim to come with a source.
Use confidence-aware prompting: 'if you don't know, say so'.
Verify high-stakes output with humans or independent checks.
Monitor for hallucinations in production via spot-checks and user-feedback loops.

When to use it

Use when

Any LLM-powered product where factual accuracy matters (customer support, legal, medical, financial).
Internal tools where users may take output at face value.
When integrating LLMs into existing knowledge-work flows.

Skip when

Pure creative or brainstorming uses where invention is the goal.
Demos and prototypes where output isn't user-facing.

Key metrics

Hallucination rate (% of factual claims that are wrong).
Citation accuracy (% of cited sources that actually support the claim).
User-flagged inaccuracy rate.
Refusal rate ('I don't know' as a healthy floor).

Examples

The model fabricated three legal citations. Real-world consequence: a sanctioned lawyer.
Hallucinations don't go away with bigger models; they get more confident.
RAG cut our hallucination rate from 8% to under 1% — at the cost of some latency.

In practice at Makreate

Makreate's AI work — both for clients building AI products and for our own internal tooling — treats hallucinations as a design constraint, not an open question. We architect with grounding (RAG) by default for any factual use case, build citation requirements into prompts, and never ship LLM output to end users without a verification layer. A recent client was building an AI customer-support tool that hallucinated product specs 6% of the time. We refactored to RAG against their actual product database and forced citation links to source docs; hallucination rate dropped under 0.5% and the support team trusted the output enough to actually use it.

AI Web App Development →

Common mistakes

Treating LLMs as databases. They're text generators that approximate databases — not reliable enough on their own.
Skipping verification on high-stakes output.
Assuming bigger models hallucinate less. They often hallucinate more confidently.
Not logging and reviewing hallucinations in production. Without logs, you can't improve.

Frequently asked

Can hallucinations be eliminated?

No. They're a structural feature of how LLMs generate text. They can be dramatically reduced (RAG, tool use, citations) but not eliminated.

Do bigger models hallucinate less?

On average, yes — but more confidently when they do. Mitigation depends on architecture (grounding, tool use), not just model size.

How do I detect hallucinations in production?

User-feedback loops, spot-checks, and verification against ground-truth sources. No fully-automated detection works perfectly yet.