Semantic Search in Production cover

A 4:17pm Slack message on a Friday. A product manager at a retail customer had spent her afternoon doing what the eng team kept promising to automate: pulled 200 queries from last week's logs, opened the live site in one tab and a staging build in another, and scored relevance side-by-side. Recall@10 on brand-and-size queries -- "Patagonia R1 men's medium" -- had dropped from 94% in September to 71% on the live system. The eval rig that should have caught it was a 40-query golden set from launch, none of them with brand-and-size. The system had been silently hurting paying customers for two weeks before a non-engineer noticed.

That's the silent-drift quarter, and it's the work this book is about. The eval rig that catches a regression on a Tuesday morning instead of waiting for a Friday-afternoon spreadsheet. The hybrid retrieval that you tune until "tuned" turns out to be a moving target. The re-embedding plan that exists before the model gets upgraded.

Twelve chapters on the discipline that separates a search system that gets better over time from one that quietly degrades. Discipline-first, opinionated, war-stories.

Your GitHub username -- you'll be auto-invited to the private companion repo with starter code and worked solutions for every chapter's Try-this section. Optional. Skip if you don't have a GitHub account or want to add it later -- email contact@yaw.sh with your order ID and GitHub username and we'll send the invite.

PDF + EPUB. Free updates as the field moves. Secure checkout. Read Chapter 1 free.

What you'll fix

Every one of these has happened on a real production system -- mine, a friend's, or a team I've audited. The book tells you what to design for and what alarm to wire before the next one.

  • The silent regression on the literal-phrase tail. Recall on brand-and-size queries craters under a hybrid-weight tweak; conceptual queries look great; the average looks fine; the revenue-bearing queries don't. Chapters 6, 8.
  • The 40-query golden set from launch that no longer reflects anything. Your eval scores look stable. Your relevance has rotted. The corpus shifted, the user intent shifted, the golden set didn't. Chapter 8.
  • The chunker that worked for docs and breaks on transcripts. "What did Sarah decide about the launch date" lands half in one chunk and half in the next. Both score middling. The right answer is invisible to retrieval. Chapter 3.
  • The model upgrade that silently bilingualizes your index. ada-002 to text-embedding-3, no migration plan, half the vectors in the old space and half in the new. Recall craters. Nothing throws an error. Chapter 10.
  • The HNSW index that was right at launch and is wrong at 4x corpus. ef_search tuned for a snapshot, never revisited. p95 doubles, you can fix it, but you have to know to look. Chapter 5.
  • The reranker that started losing. It improved NDCG by 4 points six months ago. The corpus shifted, the eval set is stale, the reranker is now hurting on a meaningful slice you can't see. Chapter 7.
  • The vector-store outage at 3am with no fallback. You promised the business 99.9% search availability; the vendor's SLA is 99.5%. The architecture you didn't think about at launch is the one you need now. Chapter 11.

What's in the twelve chapters

Part 1 - Foundations

  • Chapter 1. Why semantic search - the keyword ceiling, the "just use a vector DB" trap, the hybrid imperative, and the eval problem. The vocabulary the rest of the book uses.
  • Chapter 2. Picking an embedding model - closed vs open, dimensions, MRL, quantization, cost-vs-quality at scale, domain fit, and the model you can leave in place for eighteen months. The "I'll fine-tune later" trap.
  • Chapter 3. Chunking strategies that survive production - fixed, sentence-aware, structural, semantic. Overlap, parent-document retrieval, the small-chunk-for-retrieval pattern. Metadata you can't recover later.

Part 2 - The substrate

  • Chapter 4. The vector DB landscape - pgvector, Turbopuffer, Pinecone, Weaviate, Qdrant, LanceDB, OpenSearch, Vespa. When each one is the right answer. Hosted vs self-hosted economics at three corpus sizes. Filter performance - the part nobody benchmarks publicly.
  • Chapter 5. Indexing tradeoffs - HNSW, IVF, IVFPQ, ScaNN, DiskANN, flat. Recall vs latency vs RAM vs build time, with concrete numbers. Tuning HNSW without burning a week. Designing for a rebuild before you need one.
  • Chapter 6. Hybrid search - BM25 plus dense, fused well. RRF and where it falls short. Score normalization, per-query weighting, sparse-as-filter vs sparse-as-signal, single-call vs two-pass architectures.
  • Chapter 7. Reranking - cross-encoders, LLM rerankers, ColBERT-style late interaction. When reranking earns 200ms. The reranker-changes-the-answer-to-the-wrong-answer failure mode and how to catch it.

Part 3 - The discipline

  • Chapter 8. Evaluating search - recall@k, MRR, NDCG, and which one actually predicts user satisfaction. Building a golden set that doesn't go stale. Click-stream bias. LLM-as-judge with bias controls. Holdout vs replay.
  • Chapter 9. Query understanding - rewriting, expansion, intent classification. HyDE. Multi-query strategies and the latency tax. Acronym and entity normalization. The query-side preprocessing pipeline nobody documents.
  • Chapter 10. Drift, re-embedding, and the model-migration problem - the silent failure mode. Why every embedding system needs a re-embedding plan before it ships. Full rebuild vs dual-index vs lazy. Detecting drift before users do.

Part 4 - Production

  • Chapter 11. Serving at scale - latency budgets, three layers of caching with three invalidation problems, multi-tenant isolation, hot-reloading models, capacity planning, the 3am playbook.
  • Chapter 12. What's next - multimodal retrieval, domain fine-tuning, learned sparse, long-context-vs-retrieval. The bets I'd make today and the parts of the field about to be obsoleted.

Who it's for

You've built backend services. You know what a 99th-percentile latency violation looks like in your monitoring. You've stood up at least one search system - maybe a v0 over pgvector, maybe a Pinecone-backed feature, maybe a product-search rewrite at an e-commerce shop. You've seen "users are saying search is bad" land in your inbox. You're somewhere between mid and senior on the IC ladder, or a tech lead who needs to make build-vs-buy calls about retrieval infrastructure.

You don't need to know what an embedding is. The book assumes you understand vectors-as-points-in-space at the level of "I've put some in a database and queried by cosine similarity." If you're past that, you're at the right starting line.

Not for: introductions to embeddings (good intros exist on the open web - start there), vector-DB feature comparison spreadsheets (they rot too fast to be useful), or research-frontier surveys (the leading edge isn't yet boring enough to bet on in production).

Companion volumes

Semantic Search in Production is Volume III of the Yaw Labs Production Series. Volume I, MCP in Production, is about shipping Model Context Protocol servers - the protocol-and-server perspective. Volume II, Claude Code in Production, is about running the agentic harness that calls those servers - the operator's perspective. Volume III is about the substrate the agent reaches into when it needs to find something: retrieval as a discipline. Volume IV, A2A in Production (early access), is what happens when one agent becomes a fleet - federation, auth across hops, and partial failure.

What's in the box

  • The book in PDF and EPUB.
  • Free updates as the field moves - new embedding models, vector DB shifts, the eval techniques that beat LLM-as-judge.
  • Auto-invite to the private YawLabs/semantic-search-in-production-companion repo - starter code, exercises, and worked solutions at module-N-final tags for each hands-on chapter.

FAQ

Do I need to have read Volumes I and II?

No. Each volume in the Production Series stands on its own. Volume III is about retrieval; if you're not building MCP servers or running Claude Code, you don't need the other two to make sense of this one.

What if the embedding model landscape shifts six months from now?

Updates ship as the field moves. Chapter 2's specific model recommendations are pinned to the writing date and noted as such; the principles for picking a model port across generations. When voyage-4 or text-embedding-4 ships and changes the cost-quality picture meaningfully, the chapter gets a revision and you get the update.

Is there a print edition?

Not yet. If a print edition happens, it will be later. The digital version is the canonical living one and keeps getting updates either way.

How do the companion-repo invites work?

You enter your GitHub username at checkout. The order webhook fires an invite to that user, adding you as a collaborator to the private companion repo. You should see an email from GitHub with the accept-invitation link within a few minutes. If you don't get the invite within an hour, email contact@yaw.sh with the order ID and the GitHub username you want invited.

Buy Semantic Search in Production

Twelve chapters. PDF + EPUB. Free updates. $39 one-time, secure checkout.

Your GitHub username -- you'll be auto-invited to the private companion repo with starter code and worked solutions for every chapter's Try-this section. Optional. Skip if you don't have a GitHub account or want to add it later -- email contact@yaw.sh with your order ID and GitHub username and we'll send the invite.

Companion volumes: MCP in Production, Claude Code in Production, and A2A in Production. Together they cover the agentic-tooling stack: protocol, operator, retrieval substrate, and the multi-agent fleet.