Prompt Engineering

Definition

Prompt engineering is the discipline of designing the inputs to a Large Language Model — instructions, context, examples, format requirements, constraints — so the model consistently produces outputs that meet production requirements for accuracy, format, and tone.

Prompt engineering is to LLM applications what query optimisation is to databases or compiler flags are to native code. The same model produces wildly different outputs depending on how it's prompted. Vague prompts get vague answers; structured prompts with clear instructions, relevant context, and worked examples (few-shot) get production-quality outputs.

The craft is rapidly maturing. The early chaos of "prompt magic" is giving way to structured patterns: chain-of-thought reasoning, constitutional prompting, output schemas (JSON mode, structured generation), and prompt templating frameworks. Production prompts are versioned, evaluated, and tested like any other piece of code.

Origin

The discipline emerged with GPT-3's release (2020) when developers discovered the model's behaviour was wildly sensitive to phrasing. Formal techniques (few-shot prompting, chain-of-thought) were named in academic papers in 2022; the broader practice matured into a recognisable discipline through 2023–2024.

How it works

Define the task precisely — what output, what format, what constraints.
Write a structured prompt: role, instruction, context, format spec, examples.
Include 2–5 worked examples (few-shot) when the task is complex or format-sensitive.
Use chain-of-thought ("think step by step") for reasoning tasks.
Constrain output (JSON schema, structured generation, regex) where format matters.
Build an evaluation set; iterate prompt variations against it; ship the winner.

When to use it

Use when

On every LLM-powered feature beyond throwaway demos.
When prompt outputs are inconsistent or low-quality.
When migrating between models — prompts often need adjustment per model.

Skip when

On problems where deterministic logic would do. Prompt engineering can't fix a fundamentally non-LLM problem.

Key metrics

Task accuracy on the evaluation set.
Output format consistency (JSON validity, schema adherence).
Cost (tokens used per request).
Latency (longer prompts = slower responses).
Hallucination rate by category.

Examples

Prompt engineering cut our LLM cost in half by replacing one expensive call with three cheaper structured ones.
A well-engineered prompt is the difference between a demo and production.
We added few-shot examples and accuracy on the eval set jumped from 67% to 91%.

In practice at Makreate

Every Makreate AI build invests in prompt engineering as a first-class discipline — versioned, evaluated, and tested like code. On a recent client engagement we built a customer-support copilot. Initial prompts hit 72% accuracy on our eval set. Six iterations later — adding role definition, structured output schema, four worked examples, and chain-of-thought reasoning — we hit 94%. Same model, same data, three weeks of prompt work.

AI Web App Development →

Common mistakes

Treating prompts as throwaway strings. Prompts are product surface; version them.
No evaluation set. Without one, every change is guessing.
Stuffing prompts with too much context. Long prompts cost more, run slower, and degrade attention to key instructions.
Forgetting that prompts are model-specific. A great GPT-4 prompt may need re-engineering for Claude or Gemini.
Skipping format constraints. Free-form output is hard to consume programmatically.

Frequently asked

Few-shot or zero-shot prompting?

Few-shot when format or domain is non-obvious — examples teach the model what "correct" looks like. Zero-shot when the task is well-known to the model and brevity matters.

Should I write prompts in English?

Yes, in production, almost always. English-language training data dominates major LLMs, and English prompts produce more reliable behaviour. Specialised use cases may differ.

How do I version prompts?

Same as code — git-tracked, reviewed in PRs, with an eval suite that runs in CI. Promptlayer, LangSmith, and similar tools add observability.