Definition
A Large Language Model (LLM) is an AI system — typically built on the transformer architecture — trained on very large text corpora to predict and generate natural-language text, capable of tasks ranging from writing and summarisation to reasoning, classification and code generation.
LLMs (ChatGPT, Claude, Gemini, etc.) became the default substrate of modern AI applications between 2022 and 2025. The shift was less about a new capability and more about a usability threshold being crossed — models became accurate enough on enough tasks that building real products on them became economically viable.
Practical LLM use in production is less about raw model capability and more about engineering discipline: clear prompting, robust evaluation, fallback handling, latency budgets, cost management, and the right architecture (often RAG, agents, or both) for the specific use case. Teams that focus on the model and ignore the engineering wrap consistently underperform.
Origin
Transformer architectures emerged in Google's 2017 'Attention Is All You Need' paper. OpenAI's GPT-3 (2020) demonstrated the scaling laws that produced contemporary LLMs. ChatGPT (Nov 2022) made the capability mainstream.
How it works
- Pick a model based on capability, latency, cost (Claude, GPT, Gemini, open-source).
- Engineer the prompt — clear instructions, examples, structured output format.
- Wrap with retrieval (RAG) if the model needs current or org-specific knowledge.
- Add evaluation — how do you know the model is doing the right thing?
- Add fallback handling for failure modes (refusals, hallucinations, malformed output).
- Monitor in production — latency, cost, accuracy, drift over time.
When to use it
Use when
- For any task involving natural-language understanding or generation.
- Summarisation, classification, extraction, drafting, coding assistance.
- When the alternative is human work at scale.
Skip when
- For tasks where determinism is critical and any LLM error is unacceptable.
- When the input data is fundamentally not language-shaped.
Key metrics
- Task accuracy (against an eval set)
- Cost per request
- Latency per request
- Production error rate
- Fallback usage rate
Examples
- We replaced 60 hours/month of manual ticket triage with an LLM classifier; accuracy hit 94% in production.
- The model worked great in eval and terribly in production — the eval set wasn't representative.
- Switching from GPT-4 to Claude 3.5 Sonnet for our workload cut cost 70% with no quality drop.
In practice at Makreate
Makreate AI engineering engagements use multi-model architectures by default — different models for different workloads, with cost and latency optimisation built in from day one. A typical production system might use Claude for high-quality generation, GPT-4o for vision tasks, and a smaller model for classification. We treat model selection as an ongoing optimisation, not a one-time choice, because the frontier moves fast.
AI Web App Development →Common mistakes
- Choosing the most powerful model when a smaller one would do the job.
- Not building an evaluation set — without eval, you can't optimise.
- Ignoring fallback handling — production LLMs fail in ways eval doesn't catch.
- Hardcoding model choice — the frontier moves, your stack should too.
- Underestimating prompt engineering — small prompt changes have outsized impact.
Frequently asked
Which LLM should I use?
Depends on workload. Claude excels at reasoning and long context; GPT-4o at vision and tool use; Gemini at multimodal; open-source (Llama, Mistral) when you need on-prem or full control.
Is fine-tuning necessary?
Usually no — prompt engineering and RAG cover most use cases. Fine-tuning makes sense for very narrow style adaptation or proprietary task formats.
How do I control LLM cost in production?
Route by model (small model for easy queries, large for hard), cache aggressively, set max-token limits, monitor and alert on cost spikes.