Definition
Heuristic evaluation is a usability inspection method in which expert evaluators assess an interface against a fixed set of recognised usability principles (heuristics) — most commonly Nielsen's 10 — and document violations with severity ratings.
Heuristic evaluation is the cheapest, fastest usability method available. Three to five expert evaluators independently review the interface against the heuristic checklist, document every issue, then merge findings into a prioritised list. No user recruitment, no scheduling, no NDAs — a useful audit fits in a week.
The limitation is the inverse of the strength: experts are not users. Heuristic evaluation surfaces problems any reasonable evaluator can predict — broken back buttons, inconsistent labels, missing error messages — but it cannot reveal what real users actually misunderstand or skip. The right way to use heuristic evaluation is as a first-pass audit that clears obvious issues before the expensive observed-session work begins.
Origin
Developed by Jakob Nielsen and Rolf Molich in 1990. Nielsen's 10 heuristics, refined in 1994 and lightly updated since, remain the dominant framework. Other frameworks (Shneiderman's 8 Golden Rules, Bastien & Scapin's ergonomic criteria) cover overlapping ground.
How it works
- Pick a heuristic set — Nielsen's 10 is the default.
- Brief 3–5 evaluators on the product, scope, and target user.
- Each evaluator independently walks the product, noting heuristic violations and rating severity (0–4).
- Aggregate findings — duplicates, near-duplicates, and unique issues.
- Prioritise by severity × frequency; produce a fix list.
- Optionally re-evaluate after fixes ship to confirm resolution.
When to use it
Use when
- Before user testing — to surface obvious problems first and save observed-session time for harder questions.
- On a fast turnaround when there's no time to recruit users.
- Periodically (quarterly) on mature products to catch regressions and drift.
- On competitors' products to learn from their mistakes for free.
Skip when
- As a substitute for user testing. Heuristics catch obvious issues; only users surface comprehension and motivation issues.
- With one evaluator. A single expert finds 35% of issues; 3–5 evaluators find 60–80%.
Key metrics
- Number of issues found per evaluator (typical: 30–60 on a mid-complexity product).
- Severity distribution (catastrophic / major / minor / cosmetic).
- Issue overlap between evaluators (high overlap = high confidence).
Examples
- The heuristic evaluation flagged 40 issues in two days — 12 were critical.
- We run a heuristic evaluation before booking user tests so the tests focus on real behaviour, not obvious bugs.
- Five evaluators surfaced 78% of the issues a 20-user usability study later confirmed.
In practice at Makreate
Makreate's UX audits combine heuristic evaluation with analytics review and a sample of session recordings — we catch the cheap problems first and save observed-session budget for the deep ones. A recent fintech client got a 47-issue heuristic audit in week one; fixing the top 12 issues lifted task completion 14% before a single user was interviewed.
UX Design →Common mistakes
- Treating it as a substitute for observed user testing.
- Using one evaluator. Reliability comes from multiple independent passes.
- Listing issues without severity ratings — every fix list becomes a noise cloud without prioritisation.
- Walking the happy path only. Most heuristic violations live in error states, edge cases, and rarely-used screens.
- Evaluators who don't know the heuristics. "Expert" means trained on the framework, not just senior.
Frequently asked
How many evaluators?
3–5 is the standard recommendation. One evaluator finds ~35% of issues; three finds ~60%; five finds ~75%; diminishing returns after that.
Which heuristics?
Nielsen's 10 for general interface evaluation. For mobile, add Bastien & Scapin's criteria. For accessibility, layer in WCAG 2.1 — but that's a separate audit, not a substitute.
Can the product team do this on their own product?
Yes, but cross-team evaluation surfaces more — internal evaluators have blind spots from too much context.