RAG (Retrieval-Augmented Generation) — Definition & Examples

Definition

Retrieval-Augmented Generation (RAG) is an AI architecture that retrieves relevant documents from a knowledge base at query time, injects them into the language model's context, and generates a response grounded in those documents — reducing hallucination and enabling the model to answer questions about content it wasn't trained on.

RAG is the dominant pattern for production AI applications that need to answer questions about a specific organisation's content — internal docs, product knowledge, customer data, regulatory filings. Without RAG, the model would have to be fine-tuned on that content (expensive, slow to update). With RAG, the same off-the-shelf model can answer questions about content that didn't exist when the model was trained.

A useful RAG system has three core components: an embedding pipeline (converts documents to vector representations and stores them), a retrieval system (semantic search finds the most relevant documents for each query), and a generation step (the LLM produces an answer grounded in the retrieved documents). Each component has its own quality dimensions, and a weak link in any of them produces bad answers.

Origin

RAG was formalised in a 2020 Facebook AI Research paper. The pattern became mainstream in 2023-2024 as LangChain, LlamaIndex and similar frameworks made RAG implementation accessible.

How it works

Ingest documents into the knowledge base.
Chunk documents into retrievable pieces (typically 200-1000 tokens).
Embed each chunk using a model like OpenAI embeddings or open-source equivalents.
Store embeddings in a vector database (Pinecone, Qdrant, pgvector, etc.).
At query time, embed the query, find top-K most similar chunks, inject them into the LLM prompt with context.
Generate the answer; cite retrieved sources in the response.

When to use it

Use when

When the AI needs to answer questions about content not in the model's training data.
For internal-docs Q&A, customer-support chatbots, product knowledge assistants.
When hallucinations would be costly.

Skip when

For simple chat use cases where the model's general knowledge is sufficient.
When the document set is small enough to fit in the context window directly.

Key metrics

Retrieval precision (are the retrieved chunks actually relevant?)
Answer accuracy (factually correct, source-grounded)
Citation accuracy (are claimed sources actually saying what's claimed?)
End-to-end latency

Examples

We replaced our customer-support chatbot's static FAQ with a RAG system over our help docs; ticket resolution went up 35%.
Bad chunking sank the RAG quality — chunks too small lost context; chunks too large made retrieval imprecise.
Adding source citations to the RAG output instantly increased user trust.

In practice at Makreate

Makreate's AI Web App and Mobile App engagements have shipped multiple production RAG systems — including internal docs Q&A, customer support assistants, and product-knowledge chatbots. The pattern we default to: chunked Markdown ingestion, OpenAI text-embedding-3-small for embeddings, pgvector for storage (most clients already have Postgres), and a re-ranking step before LLM generation to catch retrieval misses. The result is answer quality our clients trust enough to deploy customer-facing.

AI Web App Development →

Common mistakes

Chunking documents poorly — too small or too large.
Not evaluating retrieval quality separately from generation quality.
Skipping re-ranking — first-pass retrieval is often imprecise.
Not handling document updates — stale RAG systems answer with old information.
Not citing sources — answers are less trustworthy without them.

Frequently asked

Is RAG better than fine-tuning?

For most use cases involving organisation-specific content, yes — RAG is cheaper, faster to update, and produces verifiable answers. Fine-tuning is better for changing model behaviour or style.

What vector database should I use?

For small projects, pgvector (Postgres extension) is fine. For scale, Pinecone, Qdrant or Weaviate. Choose based on existing infrastructure and team familiarity.

Do I need to re-rank retrieved chunks?

Yes — first-pass retrieval often returns relevant-looking chunks that aren't actually useful. A cross-encoder reranker materially improves quality.