Definition
Retrieval-Augmented Generation (RAG) is an AI architecture that retrieves relevant documents from a knowledge base at query time, injects them into the language model's context, and generates a response grounded in those documents — reducing hallucination and enabling the model to answer questions about content it wasn't trained on.
RAG is the dominant pattern for production AI applications that need to answer questions about a specific organisation's content — internal docs, product knowledge, customer data, regulatory filings. Without RAG, the model would have to be fine-tuned on that content (expensive, slow to update). With RAG, the same off-the-shelf model can answer questions about content that didn't exist when the model was trained.
A useful RAG system has three core components: an embedding pipeline (converts documents to vector representations and stores them), a retrieval system (semantic search finds the most relevant documents for each query), and a generation step (the LLM produces an answer grounded in the retrieved documents). Each component has its own quality dimensions, and a weak link in any of them produces bad answers.
Origin
RAG was formalised in a 2020 Facebook AI Research paper. The pattern became mainstream in 2023-2024 as LangChain, LlamaIndex and similar frameworks made RAG implementation accessible.
How it works
- Ingest documents into the knowledge base.
- Chunk documents into retrievable pieces (typically 200-1000 tokens).
- Embed each chunk using a model like OpenAI embeddings or open-source equivalents.
- Store embeddings in a vector database (Pinecone, Qdrant, pgvector, etc.).
- At query time, embed the query, find top-K most similar chunks, inject them into the LLM prompt with context.
- Generate the answer; cite retrieved sources in the response.
When to use it
Use when
- When the AI needs to answer questions about content not in the model's training data.
- For internal-docs Q&A, customer-support chatbots, product knowledge assistants.
- When hallucinations would be costly.
Skip when
- For simple chat use cases where the model's general knowledge is sufficient.
- When the document set is small enough to fit in the context window directly.
Key metrics
- Retrieval precision (are the retrieved chunks actually relevant?)
- Answer accuracy (factually correct, source-grounded)
- Citation accuracy (are claimed sources actually saying what's claimed?)
- End-to-end latency
Examples
- We replaced our customer-support chatbot's static FAQ with a RAG system over our help docs; ticket resolution went up 35%.
- Bad chunking sank the RAG quality — chunks too small lost context; chunks too large made retrieval imprecise.
- Adding source citations to the RAG output instantly increased user trust.
In practice at Makreate
Makreate's AI Web App and Mobile App engagements have shipped multiple production RAG systems — including internal docs Q&A, customer support assistants, and product-knowledge chatbots. The pattern we default to: chunked Markdown ingestion, OpenAI text-embedding-3-small for embeddings, pgvector for storage (most clients already have Postgres), and a re-ranking step before LLM generation to catch retrieval misses. The result is answer quality our clients trust enough to deploy customer-facing.
AI Web App Development →Common mistakes
- Chunking documents poorly — too small or too large.
- Not evaluating retrieval quality separately from generation quality.
- Skipping re-ranking — first-pass retrieval is often imprecise.
- Not handling document updates — stale RAG systems answer with old information.
- Not citing sources — answers are less trustworthy without them.
Frequently asked
Is RAG better than fine-tuning?
For most use cases involving organisation-specific content, yes — RAG is cheaper, faster to update, and produces verifiable answers. Fine-tuning is better for changing model behaviour or style.
What vector database should I use?
For small projects, pgvector (Postgres extension) is fine. For scale, Pinecone, Qdrant or Weaviate. Choose based on existing infrastructure and team familiarity.
Do I need to re-rank retrieved chunks?
Yes — first-pass retrieval often returns relevant-looking chunks that aren't actually useful. A cross-encoder reranker materially improves quality.