Retrieval Evaluation: A Practical Guide (2026)

You cannot tune what you cannot measure. The reason "we tweaked the hybrid weights and shipped" produces silent regressions is that nobody had a measurement that told them what they broke. Retrieval eval is the discipline that makes every other decision -- model choice, chunking, hybrid weights, reranker -- safe to make.

This guide covers the metrics worth knowing, the golden-set discipline that keeps them honest, click-stream bias, LLM-as-judge with bias controls, and the holdout-vs-replay decision. The full chapter is Chapter 8 of Semantic Search in Production.

The metrics that matter

Recall@k: of the truly relevant docs, what fraction landed in the top k? Use when "did we surface the right thing at all" is the question. Best primary metric for retrieval (as distinct from ranking).
MRR (Mean Reciprocal Rank): 1 / position of the first relevant doc, averaged. Useful when only the top result matters (question-answering).
NDCG@k: rewards relevant docs more when they're higher in the list, discounted by log(rank). Best primary metric for ranking quality (as distinct from retrieval).
Precision@k: of the top k results, what fraction are relevant? Useful for fixed-position UIs (the top 5 are always shown).

None of these directly measures user satisfaction. They measure aspects of it. The mapping from "NDCG went up" to "users are happier" is not automatic; you need correlation studies (or click-stream evidence) to know the mapping is real.

The golden set: the load-bearing artifact

A golden set is a curated list of (query, relevant docs) pairs you score against. It is the single most load-bearing artifact in a retrieval system -- and the most commonly stale.

What makes a good golden set:

Reflects current user intent. The query distribution matches what users are actually searching for, not what they searched for at launch.
Stratified by query class. Literal-phrase queries, conceptual queries, brand-and-size queries, navigational queries -- each gets a portion. Otherwise a metric improvement on the dominant class hides a regression on a minority class.
Has multiple correct answers per query. Single-answer golden sets penalize systems that found a different-but-also-correct doc.
Updates on a schedule. Quarterly is a minimum; monthly is better for fast-moving corpora.

The 40-query golden set trap

Most teams build a 40-query golden set at launch. It serves them well for three months. By month nine the corpus has shifted, user behavior has shifted, the queries that drive revenue are not in the set, and the eval is measuring something nobody cares about. The metric stays flat; quality rots.

The fix is to treat the golden set as a living artifact: sample real query logs monthly, label new queries (or have an LLM label them with human spot-checks), retire queries that no longer reflect real traffic. This is not glamorous work; it is the work that keeps everything else honest.

Click-stream bias

Using user clicks as ground truth feels free but is a trap. Users click on what's shown, not what's best. A system that shows mediocre results in slot 1 will get clicks on slot 1 -- not because the result is good but because it's there. Click-stream is correlated with relevance, not equivalent.

Mitigations:

Position-aware weighting: clicks lower in the list are stronger relevance signals than clicks at the top.
Pair click data with explicit feedback (thumbs, satisfaction surveys) so you can correlate.
Run interleaving experiments (mix results from system A and system B in the same ranked list; see which the user picks).

LLM-as-judge, with bias controls

Asking an LLM to score (query, doc) pairs for relevance scales the golden-set labeling problem. It also imports the LLM's biases: position bias (favors first-listed candidate), verbosity bias (favors longer answers), familiarity bias (favors recognizable entities).

The discipline that makes LLM-as-judge useful:

Randomize order. If you're comparing two candidates, randomize which one is "A" and which is "B" per query.
Define relevance precisely. "Is this relevant?" gets noisy judgments. "Does this doc directly answer the user's question? Yes/no/partial." gets cleaner ones.
Cross-check with human labels on a sample. The LLM-judge should agree with humans at high enough rate to trust unmonitored.
Don't use the same LLM you're building with. Self-judging produces self-flattering scores.

Holdout vs replay

Two ways to evaluate a system change:

Holdout: a fixed eval set; you run it before and after the change. Stable, repeatable, easy to compare across changes.
Replay: replay a slice of recent production traffic through the new system, compare results to what production actually served. Catches issues holdouts miss.

Use both. Holdouts catch metric regressions; replay catches "the system started behaving differently on the queries we care about."

Want the full chapter?

Semantic Search in Production Chapter 8 covers every metric in depth, the golden-set living-artifact discipline, click-stream bias mitigations, the LLM-as-judge harness with full bias controls, holdout-vs-replay tradeoffs, and the correlation-study work that maps offline metrics to online satisfaction.

Semantic Search in Production

The book on hybrid search and RAG retrieval. Twelve chapters. PDF + EPUB. Free updates as the field moves. Free with a Token Limit News signup.

Read it free →

Published by Yaw Labs.

Semantic Search in Production -- the book.
Hybrid Search
Embedding Drift
Re-Embedding Strategy