If you cannot measure your retrieval quality, every other discipline is theater. Eval first; tune second.
You cannot tune what you cannot measure. The reason "we tweaked the hybrid weights and shipped" produces silent regressions is that nobody had a measurement that told them what they broke. Retrieval eval is the discipline that makes every other decision -- model choice, chunking, hybrid weights, reranker -- safe to make.
This guide covers the metrics worth knowing, the golden-set discipline that keeps them honest, click-stream bias, LLM-as-judge with bias controls, and the holdout-vs-replay decision. The full chapter is Chapter 8 of Semantic Search in Production.
None of these directly measures user satisfaction. They measure aspects of it. The mapping from "NDCG went up" to "users are happier" is not automatic; you need correlation studies (or click-stream evidence) to know the mapping is real.
A golden set is a curated list of (query, relevant docs) pairs you score against. It is the single most load-bearing artifact in a retrieval system -- and the most commonly stale.
What makes a good golden set:
Most teams build a 40-query golden set at launch. It serves them well for three months. By month nine the corpus has shifted, user behavior has shifted, the queries that drive revenue are not in the set, and the eval is measuring something nobody cares about. The metric stays flat; quality rots.
The fix is to treat the golden set as a living artifact: sample real query logs monthly, label new queries (or have an LLM label them with human spot-checks), retire queries that no longer reflect real traffic. This is not glamorous work; it is the work that keeps everything else honest.
Using user clicks as ground truth feels free but is a trap. Users click on what's shown, not what's best. A system that shows mediocre results in slot 1 will get clicks on slot 1 -- not because the result is good but because it's there. Click-stream is correlated with relevance, not equivalent.
Mitigations:
Asking an LLM to score (query, doc) pairs for relevance scales the golden-set labeling problem. It also imports the LLM's biases: position bias (favors first-listed candidate), verbosity bias (favors longer answers), familiarity bias (favors recognizable entities).
The discipline that makes LLM-as-judge useful:
Two ways to evaluate a system change:
Use both. Holdouts catch metric regressions; replay catches "the system started behaving differently on the queries we care about."
Semantic Search in Production Chapter 8 covers every metric in depth, the golden-set living-artifact discipline, click-stream bias mitigations, the LLM-as-judge harness with full bias controls, holdout-vs-replay tradeoffs, and the correlation-study work that maps offline metrics to online satisfaction.
Semantic Search in Production
The book on hybrid search and RAG retrieval. Twelve chapters. PDF + EPUB. Free updates as the field moves. $39 one-time, secure checkout.
Read more & buy $39 →Published by Yaw Labs.