Definition
Fine-tuning is the process of taking a pre-trained AI model and continuing to train it on a smaller, domain-specific dataset — adapting its behaviour, style, or knowledge to a particular use case without training a model from scratch.
Fine-tuning was the default customisation approach in the GPT-3 era. With the rise of much larger models and retrieval-augmented architectures (RAG), most production teams now reach for prompting and RAG before fine-tuning — they're cheaper, faster to iterate, and produce results that are easier to inspect and update.
Fine-tuning still has legitimate uses: enforcing very specific output formats, adapting model style to a brand voice, training on proprietary task data where prompt engineering hits a ceiling. But it's no longer the first move; it's a specialised tool.
Origin
Fine-tuning as a transfer-learning technique predates LLMs — it's been standard practice in deep learning since ~2014. The OpenAI fine-tuning API (2021) brought it to mainstream LLM use; the technique remains essential in computer vision and speech.
How it works
- Determine whether fine-tuning is actually needed (try prompting and RAG first).
- Build a high-quality dataset (typically 50-1,000 examples for instruction fine-tuning).
- Choose the base model (GPT-4o-mini, Llama, Claude, etc. — vendor support varies).
- Run the fine-tuning job (most major vendors offer managed fine-tuning).
- Evaluate the fine-tuned model against the base model on a held-out test set.
- Deploy with monitoring; budget for ongoing retraining as data drifts.
When to use it
Use when
- When prompting and RAG can't achieve the required output quality or style.
- For domain-specific task formats with proprietary data.
- When latency or cost demands a smaller fine-tuned model.
Skip when
- For general-purpose tasks — the base models are very strong.
- When the dataset is small (under 50 examples).
- Before exhausting prompt engineering and RAG.
Key metrics
- Task accuracy on held-out test set
- Latency vs. base model
- Cost per request vs. base model
- Time-to-update when behaviour needs to change
Examples
- We fine-tuned a small model on customer-support ticket categorisation and cut cost 95% with no accuracy loss.
- The fine-tune didn't beat the prompt — we'd over-indexed on the technique.
- Fine-tuning on brand voice produced cleaner copy than any prompt we'd written.
In practice at Makreate
Makreate AI engagements treat fine-tuning as a specialised tool, not a default. We typically exhaust prompting and RAG before recommending fine-tuning, because the iteration speed of prompting is materially faster and the quality gap has narrowed substantially with modern models. When fine-tuning is genuinely the right tool — for very specific output formats or significant cost optimisation — we build the eval framework first and the fine-tune second.
AI Web App Development →Common mistakes
- Fine-tuning before exhausting prompting and RAG.
- Fine-tuning on too small a dataset.
- Not measuring against base-model performance.
- Forgetting that fine-tuned models age — retrain as data drifts.
- Locking yourself into a vendor's fine-tuning format.
Frequently asked
How many examples do I need to fine-tune?
Highly variable. Instruction fine-tuning often works with 50-500 examples. Classification tasks may need 1,000+. Style adaptation often needs surprisingly few (50-200).
Fine-tune or use RAG?
RAG for current/proprietary knowledge; fine-tuning for output style or format. They can also be combined.
Should I fine-tune the latest model or a smaller one?
For cost-sensitive workloads, fine-tune the smallest model that meets quality bars. For quality-sensitive workloads, prompt the largest available model first.