Semantic Search in Production is Out

A 4:17pm Slack on a Friday. The product manager at an outdoor-gear retailer had spent her afternoon doing the thing eng kept promising to automate and never had: pulled two hundred queries from the previous week's logs, opened the live site in one tab and a staging build in another, and clicked through scoring relevance by hand. The Slack message was the spreadsheet.

The numbers were not subtle. On queries that named a specific brand and a specific size - Patagonia R1 men's medium, Salomon X Ultra 4 size 11 - recall@10 had dropped from 94% to 71% in fourteen days. Broad conceptual queries were unchanged. The regression was in the literal-phrase tail, which was 30% of volume and a much larger share of revenue, because those were the queries shoppers ran when they already knew what they wanted to buy.

An engineer had read a paper, bumped the dense-retrieval weight in the hybrid fusion, A/B'd it on conceptual queries (where it looked great), and rolled it out. The eval rig that should have caught the regression was a stale golden set from launch - forty queries, all conceptual, none with brand-and-size. The system had been silently hurting paying customers for two weeks before a non-engineer with a spreadsheet noticed.

The three months of rebuilding eval discipline that followed - not the original launch - is the work this book is about.

Buy Semantic Search in Production $39 →

What's in the twelve chapters

Part 1 - Foundations. Why semantic search exists as a thing distinct from keyword search and what gap it actually fills (the keyword ceiling, the just-use-a-vector-DB trap, the hybrid imperative, the eval problem). Picking an embedding model that survives eighteen months in production - closed vs open, dimensions, MRL, quantization, the I'll-fine-tune-later trap. Chunking strategies that don't lose metadata you can't recover.

Part 2 - The substrate. The vector DB landscape - pgvector, Turbopuffer, Pinecone, Weaviate, Qdrant, LanceDB, OpenSearch, Vespa - and when each one is the right answer. Indexing tradeoffs (HNSW, IVF, IVFPQ, ScaNN, DiskANN) with concrete numbers, not benchmarks-laundered-as-marketing. Hybrid search done seriously: BM25 plus dense, RRF and where it falls short, score normalization, two-pass architectures. Reranking - cross-encoders, LLM rerankers, ColBERT-style late interaction, and the failure mode where the reranker changes the answer to the wrong answer.

Part 3 - The discipline. Evaluating search - recall@k, MRR, NDCG, and which one actually predicts user satisfaction. Building a golden set that doesn't go stale. Click-stream bias. LLM-as-judge with bias controls. Query understanding (rewriting, expansion, HyDE, multi-query strategies and the latency tax). And the silent failure mode: drift, re-embedding, and the model-migration problem nobody warns you about. Full rebuild vs dual-index vs lazy. Why every embedding system needs a re-embedding plan before it ships.

Part 4 - Production. Serving at scale - latency budgets, three layers of caching with three invalidation problems, multi-tenant isolation, hot-reloading models, capacity planning, the 3am playbook. Then what's next - multimodal retrieval, domain fine-tuning, learned sparse, long-context-vs-retrieval. The bets I'd make today and the parts of the field about to be obsoleted.

Why this book exists

The embedding APIs are good. The vector databases are real. The quickstart tutorials get you to a v0 in thirty minutes. None of them tell you what happens between "my pgvector index returns results" and "I run a system real users trust on day 180, where the relevance hasn't drifted under a model upgrade, the eval set still reflects what users actually search for, the hybrid weights are tuned on real traffic, and the re-embedding plan exists and has been rehearsed."

The gap is not bridged by another tutorial. It's bridged by discipline - the kind that catches the regression on a Tuesday morning instead of waiting for a Friday-afternoon spreadsheet.

Who it's for

You've built backend services. You've stood up at least one search system - a v0 over pgvector, a Pinecone-backed feature, a product-search rewrite at an e-commerce shop. You've seen "users are saying search is bad" land in your inbox. You're somewhere between mid and senior on the IC ladder, or a tech lead who needs to make build-vs-buy calls about retrieval infrastructure.

You don't need to know what an embedding is. The book assumes you understand vectors-as-points-in-space at the level of "I've put some in a database and queried by cosine similarity." If you're past that, you're at the right starting line.

Not for: introductions to embeddings (good intros exist on the open web), vector-DB feature comparison spreadsheets (they rot too fast), or research-frontier surveys (the leading edge isn't yet boring enough to bet on in production).

Companion volumes

Semantic Search in Production is Volume III of the Yaw Labs Production Series. Volume I, MCP in Production, is the protocol-and-server perspective on the tools agents call. Volume II, Claude Code in Production, is the operator's view of running an agent. Volume III is the substrate the agent reaches into when it needs to find something. Volume IV, A2A in Production (early access), is what happens when one agent becomes a fleet.

What's in the box

The book in PDF and EPUB.
Free updates as the field moves - new embedding models, vector DB shifts, the eval techniques that beat LLM-as-judge.
Auto-invite to the private YawLabs/semantic-search-in-production-companion repo - starter code, exercises, and worked solutions at module-N-final tags for each hands-on chapter. Add your GitHub username at checkout; the invite arrives within minutes.

Want to read before you buy? Chapter 1 is free. The shape of the book matches the shape of the chapter.

Twelve chapters on the discipline of retrieval.

PDF + EPUB. Free updates as the field moves. $39 one-time, secure checkout.