The classic semantic-search failure mode is not catastrophic. Nothing crashes. No alarm fires. The system you shipped six months ago is quietly worse this quarter than it was last quarter, and the only signal is a product manager pulling 200 queries into a spreadsheet on a Friday afternoon and scoring them by hand.

This is drift. Three kinds, all silent, all gradual, and the discipline that catches them is wiring metrics that fire on shape-of-the-data changes rather than on errors. The full chapter is Chapter 10 of Semantic Search in Production.

The three kinds of drift

Corpus drift

The documents in your index change. New product categories get added; old ones get retired; the writing style shifts because the content team turned over. The embedding model is the same; the queries are the same; the index is full of different stuff.

Symptom: recall@k on existing eval queries stays flat, but real user queries (especially navigational ones for new content) start missing.

Query drift

Users start searching for things they didn't search for before. New product launches, seasonal patterns, the cultural moment shifting -- the queries hitting your system are not the queries you tested against at launch.

Symptom: your eval set looks healthy. Click-through rates on real traffic drop. Support tickets mention "search doesn't find X" where X is something nobody was looking for a year ago.

Model drift (the upgrade-without-migration case)

You upgrade the embedding model on the write path. Old vectors are still in the index from the old model. The two embedding spaces don't compare. Recall craters on the half of the corpus that's bilingual. See re-embedding strategy for the migration patterns.

Symptom: a step-function drop in retrieval quality timed to a deploy that "shouldn't have changed anything."

The silent-quarter failure mode

The hardest version is when all three are happening at once at small rates. Each contributes 1-2 percentage points of recall loss per month. Six months later you've lost 10+ points. No deploy caused it; no metric crossed a threshold; nobody noticed until the cumulative effect was unignorable.

The mitigation is wiring continuous monitoring that surfaces the gradual case, not just the catastrophic case.

What to monitor

The alarm shape

None of these metrics should page on a single bad day. Drift is gradual; page-worthy drift is a sustained trend. The right alarm shape is "metric X has been Y% below its trailing 30-day average for N consecutive days." That catches sustained regression without firing on noise.

What to do when an alarm fires

  1. Identify which drift. Look at corpus shape first (cheap to check), then query distribution, then model version distribution. Each has a different fix.
  2. Quantify the impact. Is recall down 2 points on tail queries or 15 points on the dominant cluster? The fix path differs.
  3. Pick the response. Corpus drift may need a re-tuning of hybrid weights; query drift may need a query-rewriting pass; model drift needs a re-embedding migration.
  4. Update the eval set. Drift that's worth fixing is drift the eval set should reflect going forward.

Want the full chapter?

Semantic Search in Production Chapter 10 covers all three drift modes in depth, the monitoring stack with full metric definitions, the alarm-shape patterns, and the runbook for each drift mode with concrete steps.

Semantic Search in Production

The book on hybrid search and RAG retrieval. Twelve chapters. PDF + EPUB. Free updates as the field moves. $39 one-time, secure checkout.

Read more & buy $39

Published by Yaw Labs.

Related