← Dictionary
AInoun

RAG (Retrieval-Augmented Generation)

/ræɡ/

An AI technique that looks up real documents to ground its answers, instead of relying only on training data.

Definition

Retrieval-Augmented Generation (RAG) is an AI architecture that retrieves relevant documents from a knowledge base at query time, injects them into the language model's context, and generates a response grounded in those documents — reducing hallucination and enabling the model to answer questions about content it wasn't trained on.

RAG is the dominant pattern for production AI applications that need to answer questions about a specific organisation's content — internal docs, product knowledge, customer data, regulatory filings. Without RAG, the model would have to be fine-tuned on that content (expensive, slow to update). With RAG, the same off-the-shelf model can answer questions about content that didn't exist when the model was trained.

A useful RAG system has three core components: an embedding pipeline (converts documents to vector representations and stores them), a retrieval system (semantic search finds the most relevant documents for each query), and a generation step (the LLM produces an answer grounded in the retrieved documents). Each component has its own quality dimensions, and a weak link in any of them produces bad answers.

Origin

RAG was formalised in a 2020 Facebook AI Research paper. The pattern became mainstream in 2023-2024 as LangChain, LlamaIndex and similar frameworks made RAG implementation accessible.

How it works

  1. Ingest documents into the knowledge base.
  2. Chunk documents into retrievable pieces (typically 200-1000 tokens).
  3. Embed each chunk using a model like OpenAI embeddings or open-source equivalents.
  4. Store embeddings in a vector database (Pinecone, Qdrant, pgvector, etc.).
  5. At query time, embed the query, find top-K most similar chunks, inject them into the LLM prompt with context.
  6. Generate the answer; cite retrieved sources in the response.

When to use it

Use when

  • When the AI needs to answer questions about content not in the model's training data.
  • For internal-docs Q&A, customer-support chatbots, product knowledge assistants.
  • When hallucinations would be costly.

Skip when

  • For simple chat use cases where the model's general knowledge is sufficient.
  • When the document set is small enough to fit in the context window directly.

Key metrics

Examples

In practice at Makreate

Makreate's AI Web App and Mobile App engagements have shipped multiple production RAG systems — including internal docs Q&A, customer support assistants, and product-knowledge chatbots. The pattern we default to: chunked Markdown ingestion, OpenAI text-embedding-3-small for embeddings, pgvector for storage (most clients already have Postgres), and a re-ranking step before LLM generation to catch retrieval misses. The result is answer quality our clients trust enough to deploy customer-facing.

AI Web App Development →

Common mistakes

Frequently asked

Is RAG better than fine-tuning?

For most use cases involving organisation-specific content, yes — RAG is cheaper, faster to update, and produces verifiable answers. Fine-tuning is better for changing model behaviour or style.

What vector database should I use?

For small projects, pgvector (Postgres extension) is fine. For scale, Pinecone, Qdrant or Weaviate. Choose based on existing infrastructure and team familiarity.

Do I need to re-rank retrieved chunks?

Yes — first-pass retrieval often returns relevant-looking chunks that aren't actually useful. A cross-encoder reranker materially improves quality.

Related terms

WhatsApp