Semantic Search in Production - Chapter 1: Why Semantic Search

The Friday a PM ran the eval

The Slack message landed at 4:17pm on a Friday in March 2025. It was from a product manager at an outdoor-gear retailer whose product-search system I had helped ship the previous fall, and she had not gone through engineering. She had spent her Friday afternoon doing the thing the eng team kept promising to automate and never had: pulled two hundred queries from the previous week's logs, opened the live site in one tab and a staging build in another, and clicked through side-by-side, scoring relevance. The Slack message was the spreadsheet.

The numbers were not subtle. On queries that mentioned a specific brand and a specific size -- "Patagonia R1 men's medium," "Salomon X Ultra 4 size 11" -- recall@10 had dropped from 94% in her September baseline to 71% on the live system. Broad conceptual queries ("warm jacket for shoulder season") were unchanged. The regression was in the literal-phrase tail, which was roughly 30% of volume and a much larger share of revenue, because those were the queries shoppers ran when they already knew what they wanted to buy.

The change had shipped fourteen days earlier. An engineer read a paper, bumped the dense-retrieval weight in the hybrid fusion, A/B'd it on conceptual queries (where it looked great), and rolled it out. The eval rig that should have caught the regression was a stale golden set from launch -- forty queries, all conceptual, none with brand-and-size. The live system had been silently hurting paying customers for two weeks before a non-engineer with a spreadsheet noticed.

I rolled the change back that evening; recall on the literal-phrase queries snapped to 92%. The PM, to her credit, was not interested in litigating who broke what. She wanted to know why the system that was supposed to tell us about regressions had not. The three months of rebuilding the eval discipline that followed -- not the original launch -- is the work I learned the most from.

This book is about the work after the v0. The hybrid retrieval that you tune until "tuned" turns out to be a moving target. The eval rig that catches the regression on a Tuesday morning instead of waiting for a Friday-afternoon spreadsheet. The discipline is the part nobody warned me about.

But before any of that, I owe you the answer to a more basic question: why does semantic search exist as a thing distinct from keyword search, and what is the actual gap it fills?

The Keyword Ceiling

Keyword search -- BM25, TF-IDF, the family of algorithms that an Elasticsearch or OpenSearch cluster runs by default -- is not bad. It is a remarkable piece of information-retrieval engineering, the result of fifty years of iteration, and on the queries it handles well, it is fast, cheap, and explainable in ways that no embedding-based system has yet matched. If your queries are mostly literal -- product SKUs, error codes, exact phrases, named entities -- keyword search will outperform any vector index you can buy or build, on every axis that matters: latency, cost, recall, and the ability to tell a user why a given result ranked where it did.

The problem is not that keyword search is bad. The problem is that keyword search has a ceiling, and the ceiling is the gap between the words a user types and the words the right document contains.

Three failure modes account for almost all of it.

The vocabulary mismatch. The user types "install hangs after disk select." The document says "the installer appears to pause during target disk inspection." There is no token overlap. The two phrases are about the exact same problem, written by people who did not consult each other. BM25 sees zero match and ranks the document somewhere on page four.

The compositional query. The user types "lightweight running shoes for flat feet under $100." The product catalog has the right shoe in it, but no single document contains all five concepts as keywords. Some documents have "lightweight" and "running"; one has "flat foot support"; one has "$89.99". Keyword search returns a chaotic mix of partial matches, and the right product is buried because no single document scores high on the full query.

The conceptual question. The user types "what was Q3 revenue." The right document is the Q3 earnings report, which says "third-quarter results" and "$48.2M" and never uses the word "revenue" because the financial team prefers "net sales." Keyword search has no notion that "Q3" and "third-quarter" are the same thing, or that "revenue" and "net sales" are roughly the same thing in this context.

The standard responses to each of these are well-known and partial. For vocabulary mismatch you add synonym dictionaries -- which work, until you discover that "freezes" can mean "pauses" or it can mean "permanently locked," and the synonym is right in some contexts and wrong in others. For compositional queries you add query expansion -- which works, until the expansions explode the query into something the index can't reasonably score. For conceptual questions you write rules -- which works, for the rules you wrote, and fails for every conceptual question you didn't anticipate.

These responses are patches. They are reasonable patches; some of them are necessary even in semantic-search systems. But each one is bounded in the same way: you are trying to bridge the user's language and the document's language by enumerating the bridges. The number of bridges is unbounded. The patches always lag.

The pitch for semantic search is that the bridge can be learned, not enumerated. An embedding model trained on a large corpus has, in some functional sense, seen "install hangs" and "installer pauses" appear in similar contexts often enough that their embeddings are near each other in vector space. A query that says one of them is close, in the geometry, to a document that says the other. You do not write the synonym. The model has, in effect, written it for you.

That is the pitch. It is real. The first time you watch a vector index return the right document for a query that shares zero words with it, the experience is genuinely a little uncanny. It is also, as I will spend the next two hundred pages explaining, much less than half of the system you actually need.

The "Just Use a Vector DB" Trap

There is a class of blog post and conference talk that has been roughly the same since 2023. The structure is: a problem (search is hard), a hero (embeddings), a quickstart (here is how to put your documents into Pinecone or pgvector), a demo (look how good the results are on this one query), and a closing line about how vector search has changed everything. These talks are not wrong. They are incomplete in a specific and consequential way.

What they leave out is what happens when you replace a keyword index with a vector index naively.

Vector search, on its own, is bad at things keyword search is good at. This is not a controversial claim in the literature; it is well-documented. ANN-only retrieval underperforms BM25 on a meaningful fraction of common-shape queries, especially:

Queries with rare terms. A product SKU, an error code, a stack-trace fragment, a person's last name. The embedding model has to compete with all of its training distribution to assign meaningful weight to a rare token, and often does not.
Queries that are mostly a literal phrase. "Refund policy for digital goods purchased on Black Friday." The query has structure that BM25 captures by token co-occurrence, and that dense retrieval flattens.
Negation and exclusion. "Documents about deployment that don't mention Docker." Embedding models have famously poor handling of negation, which is its own research literature; in practice you cannot trust dense retrieval to honor "without" or "except" in queries.
Filtered queries with high-cardinality filter values. "Tickets from customer X in the last 30 days about billing." The filter is most of the query. Vector search degrades sharply when the filter is selective; in many indexes this is a cliff, not a slope. Chapter 5 spends a lot of time on this.

A team that switches from keyword to vector search and tells me their search "is now better" is sometimes telling me that it is better on the queries the demo highlighted, and worse on a long tail of queries the demo didn't show. Their relevance numbers, if they have them, are usually averaged in a way that hides the regressions. The retailer in the opening anecdote did get worse, on a measurable subset of queries, when the engineer over-weighted the dense retriever in the fusion. The brand-and-size queries fell off a cliff. The SKU queries fell off a cliff. The product pages that shoppers used to find by typing exact phrases were now buried under thematically-similar listings that didn't actually contain the item they wanted to buy.

The fix was not to go back to keyword search. The fix was to combine them. Which is the next section.

The Hybrid Imperative

If you take one thing from this chapter, take this: production semantic search is not a vector-DB system. It is a hybrid retrieval system in which a vector DB is one component. The other component is, almost always, a sparse retriever -- BM25 or a learned sparse model. The two retrievers run in parallel; their results are fused; the fused list is reranked; the top of the reranked list is what the user sees.

The reason for this is not theoretical. It is empirical, and it is unanimous across the production teams I have talked to in the last two years. Every team I have spoken with that has shipped semantic search at scale, and that has measured their relevance, has ended up with a hybrid system. Every team I have spoken with that has shipped a vector-only system has, within a quarter or two, started layering sparse signals back in -- usually by way of metadata filters, then by way of a sparse retriever proper.

The teams that haven't gone hybrid generally fall into one of three categories:

They haven't measured yet. Their vector-only system feels great in dev, on the queries they remember to test. They will discover the long tail when they instrument the system or when a user reports it.
Their query distribution is unusually narrow. Some semantic search systems serve a specific kind of conceptual query and never see literal-phrase or rare-term queries. Code search over function descriptions, for example, can be vector-dominant. These cases exist. They are rarer than the talks make them sound.
They are running on a vector DB whose pricing model penalizes hybrid. This used to be a real problem on managed services that charged per-query and treated a hybrid query as two queries. It is less of a problem in 2026; most vector DBs now offer hybrid search natively or at low markup.

The hybrid imperative is the throughline of this book. Chapter 6 is the chapter on hybrid retrieval specifically -- how you fuse the two retrievers, where reciprocal rank fusion (RRF) is the right choice and where it is the lazy choice, how to handle per-query weighting, the difference between sparse-as-filter and sparse-as-signal. But the imperative shows up everywhere. Chapter 4's vector-DB selection criteria are different if you've accepted that you'll need a sparse retriever alongside it. Chapter 8's evaluation methodology has to account for both retrievers. Chapter 11's latency budget is different because you are running two retrievers in parallel, not one in serial.

There is a question buried in this section that I want to surface explicitly: if hybrid retrieval is the eventual answer, why not just build a hybrid system from day one and skip the vector-only phase? The honest answer is that you can, and in 2026 you probably should. The reason the vector-only phase keeps happening is historical: vector DBs in 2022-2023 didn't have great hybrid support, the integration was clunky, and many teams shipped a vector-only v0 because it was the cheapest path to a working demo. Today, pgvector ships BM25 functionality alongside vector search; Turbopuffer was hybrid from day one; the major managed services all have native hybrid. There is no longer a good reason to ship vector-only on purpose. There is, however, a continuing reason to ship vector-only by accident, which is that the introductory tutorials still teach it that way.

This book teaches it the other way. By the time you finish Chapter 6 you should have a strong default of "hybrid first, vector-only never," and the rest of the chapters will assume that as the architecture.

The Eval Problem

There is a phrase I keep finding myself saying to teams that have shipped a v0 search system and are asking me to look at it: how do you know it's good?

The answers I get fall into a small set:

"Users haven't complained recently." (Then your users have given up on search and gone elsewhere; absence of complaint is not evidence of relevance.)
"We tested it on twenty queries when we launched." (Six months ago, on a corpus that has since changed by 40%.)
"Click-through rates are up." (Compared to what, and is that because relevance improved or because the UI changed?)
"It feels better." (This is the most common answer.)

None of these is an instrument. None of these tells you whether your system is getting better or worse over time. None of these tells you which of the three changes you shipped last week actually moved the needle. Without an instrument, you are flying on vibes, and the vibes are wrong more often than they are right.

The eval problem in semantic search is a specific problem with specific contours, and Chapter 8 is dedicated to it. But the contours are worth naming here so the rest of the book has a vocabulary.

Your dev-time relevance numbers will lie to you. The set of queries you test against in dev is a sample; the sample is biased toward queries you can think of, which are the queries that are easiest to answer. Production query distribution is heavier-tailed and weirder. Dev recall@10 of 92% is very compatible with production recall@10 of 71% on the long tail.

The golden set you build will go stale. A golden set is a curated list of (query, relevant-document) pairs that you use to score retrieval. The day you build it, it reflects the corpus and the user intent. Six months later, the corpus has new documents, the old documents have been edited, and the user intent has shifted (maybe a marketing campaign drove a new query shape). Your golden set is now scoring against a fiction. Your scores look stable. Your relevance has rotted.

Click-stream data is biased. Users click what they see. They don't click what wasn't shown. Inferring relevance from clicks systematically rewards the top of the list and punishes the parts of the list users never reached. You cannot use click-stream data alone to decide whether the right document was retrieved -- only whether, given what was retrieved, users found something useful.

LLM-as-judge is the new shiny thing and it has its own biases. Using a large model to grade retrieval results is genuinely useful. It also has known failure modes -- self-preference bias, position bias, length bias, the tendency to confidently agree with itself when wrong. A naive LLM-judge eval will tell you your retrieval is great, in a confident voice, when it isn't.

The honest version of the eval story, which I will spend Chapter 8 making concrete, is: you need multiple instruments, you need to refresh them on a schedule, and you need to design your eval pipeline so that you find out about regressions in days, not months. None of this is glamorous work. None of this is what gets celebrated in talks. All of this is what separates a search system that gets better over time from one that quietly degrades.

The teams I see ship the best semantic search systems are the teams who treat eval as a permanent part of the system, not a launch checklist. The teams I see ship the worst are the teams who tested on twenty queries, shipped, and assumed the curve from there was up.

What Breaks At Month Six

I want to spend a paragraph on each of the things that break at month six, because they are the implicit chapter headings of the rest of this book.

The model gets upgraded and your index is now bilingual. OpenAI ships text-embedding-3 to replace ada-002. Cohere ships embed-v4. Voyage ships voyage-3. Each of these moves the geometry. The vectors in your index, if some of them were embedded with the old model and some with the new, no longer live in the same space. Recall craters silently. Nothing throws an error. This is Chapter 10.

Your chunking strategy was wrong for a corpus you didn't anticipate. You started with a docs corpus, fixed-size chunks worked fine. You added a transcripts corpus, and the fixed-size chunks landed the answer to "what did Sarah decide about the launch date" half-in one chunk and half-in the next. Both chunks score middling on the query; neither contains the full answer. The system retrieves an irrelevant chunk that scored higher. You blame the model. The model is fine; the chunker is wrong. This is Chapter 3.

Your hybrid weights drift out of tune. You set the BM25/dense weights at launch based on a small eval. The query distribution shifts -- maybe you launched in a new region, maybe a product line was added, maybe SEO traffic patterns changed. The original weights are no longer optimal. There is no alarm for this; the system just gets worse. This is Chapter 6 and Chapter 8 together.

The reranker starts losing. You added a reranker because it improved NDCG by four points on the eval set. Six months in, the corpus has shifted, the eval set is stale, and the reranker is now hurting on a meaningful slice of queries -- it's promoting the wrong candidate from a top-50 it doesn't fully understand. Removing the reranker would help on those queries but hurt on others. You don't know which until you measure. This is Chapter 7.

The query distribution shifts. Users discover a new way to phrase a query -- maybe a competitor's marketing trained a new vocabulary, maybe an LLM-mediated UI started rephrasing user inputs -- and the new shape is one your retrieval pipeline handles badly. The old query patterns still work. The new ones don't. Your average metrics look stable; the new patterns are buried in the average. This is Chapter 9.

You hit a latency wall and don't know why. Search latency was 80ms p95 at launch. It's 240ms p95 now. The corpus has grown 4x. The HNSW index parameters that were right at launch are wrong now -- ef_search is too high for the new corpus size, or the index has too many small segments. You can fix it, but you have to know to look. This is Chapter 5 and Chapter 11.

You wake up to a vector store outage. Your managed vector DB has an incident. Search is down. You have no fallback. You promised the business 99.9% search availability and the vendor's SLA only guarantees 99.5%. The architecture you didn't think about at launch is the architecture you need at 3am. This is Chapter 11.

These are not exotic failure modes. Every one of them has happened on a real system I have shipped, audited, or had described to me in a 1:1 with a friend. Every one of them is preventable, mitigable, or detectable -- but only if you know to design for it.

What This Book Is And Isn't

There are, by my count, four categories of semantic-search content in the world right now.

The research papers. arXiv is full of them. Many are excellent. They are written for an audience that is trying to advance the state of the art, not for an audience that is trying to ship and maintain a production system. Reading the papers cold and trying to extract production guidance is like reading database research papers and trying to extract advice on running a Postgres instance. The papers are upstream of the production craft, not the production craft itself.

The vendor blog posts. Pinecone, Weaviate, Qdrant, Cohere, OpenAI, and a dozen others publish a steady stream of high-quality content that is, also, marketing for their products. These posts are useful and biased. The bias is not always conscious; engineers who work at a vector DB company have, by the nature of their work, a model of search that centers on the vector DB. The bias compounds when teams build their entire production architecture from the blog posts of the vendor whose database they bought.

The quickstart tutorials. "Build semantic search in 30 minutes." There are hundreds. They are decent. They get you to a v0. They almost never address what happens after, which is what this book is for.

The missing book. The thing in between. A practitioner's view of what production semantic search is actually like. War stories. What I learned shipping search systems for paying customers, getting them wrong, and fixing them. What I would tell you over a beer if you said "I'm about to ship a semantic-search feature, what should I know."

This book is that fourth thing.

It is not an introduction to embeddings. There are good intros and they're all on the open web. The book assumes you understand vectors-as-points-in-space at the level of "I've put some in a database and queried by cosine similarity."

It is not a vector-DB tutorial. Chapter 4 is opinionated about which database to pick when, but you will not find a step-by-step "how to set up Pinecone" walkthrough. The vendor docs do that better than I would, and they update when the vendor changes its API.

It is not a feature-comparison spreadsheet. The vendor landscape moves too fast for a spreadsheet to be accurate by the time you read it. The principles for choosing don't move that fast, and that's what Chapter 4 leans on.

It is not a survey of the research frontier. Chapter 12 names the directions I think are about to get production-relevant. The rest of the book is about what is already production-relevant.

It is opinionated. The opinions are earned. They will sometimes contradict the official vendor guidance, sometimes contradict popular tutorials, sometimes contradict things I've said publicly before that I've since changed my mind on. When I am opinionated I will tell you why. When I am uncertain I will tell you that too.

Forward Map

Here is what the rest of the book covers. Each chapter is meant to be useful on its own; you do not have to read in order, although Chapters 2 and 3 build the substrate that the rest of the book assumes.

Chapter 2: Picking an Embedding Model. Closed vs open. Dimensions, MRL, quantization. Cost-vs-quality at production scale. Domain fit. The model you can leave in place for eighteen months. The "I'll fine-tune later" trap.

Chapter 3: Chunking Strategies That Survive Production. Fixed, sentence-aware, structural, semantic. Overlap and parent-document patterns. Metadata you can't recover later. Why a docs corpus, a code corpus, and a transcripts corpus need different chunkers.

Chapter 4: The Vector DB Landscape. pgvector, Turbopuffer, Pinecone, Weaviate, Qdrant, LanceDB, OpenSearch, Vespa. When each one is the right answer. Hosted vs self-hosted economics at three corpus sizes.

Chapter 5: Indexing Tradeoffs. HNSW, IVF, IVFPQ, ScaNN, DiskANN, flat. Recall vs latency vs RAM vs build time. Tuning HNSW without burning a week. Designing for a rebuild before you need one.

Chapter 6: Hybrid Search. BM25 + dense, fused well. RRF and where it falls short. Per-query weighting. Sparse-as-filter vs sparse-as-signal. Single-call vs two-pass architectures.

Chapter 7: Reranking. Cross-encoders, LLM rerankers, ColBERT-style late interaction. When reranking earns its keep. The reranker-makes-it-worse failure mode and how to catch it.

Chapter 8: Evaluating Search. Recall@k, MRR, NDCG -- which one predicts user satisfaction. Building golden sets that don't go stale. LLM-as-judge with bias controls. Holdout vs replay.

Chapter 9: Query Understanding. Rewriting, expansion, intent classification. HyDE. Multi-query strategies and the latency tax. Acronym and entity normalization.

Chapter 10: Drift, Re-embedding, and the Model-Migration Problem. Why every embedding system needs a re-embedding plan before it ships. Full rebuild vs dual-index vs lazy strategies. Detecting drift before users do.

Chapter 11: Serving at Scale. Latency budgets. Three layers of caching, three invalidation problems. Multi-tenant isolation. Hot-reloading models. The 3am playbook.

Chapter 12: What's Next. Multimodal. Domain fine-tuning. Learned sparse. Long-context-vs-retrieval. The bets I'd make today.

We start in Chapter 2 with the first decision you make and the one with the longest half-life: which embedding model you commit to. If you have ever wondered whether to take the off-the-shelf API or self-host an open model, whether dimensions matter, what MRL gets you, or what it actually costs to embed a hundred-million-document corpus, that's where we go next.

But before you turn the page, do me a favor. Open whatever search system you currently maintain. Pull up the analytics, if you have them. Look at the queries from the last seven days that returned zero results, or where the user clicked something on page two, or where they refined the query within ten seconds of submitting it. Those are the queries your system is failing on. They are the most honest brief you will get for the rest of this book.

That is the whole pitch.

Read the rest of the book

Eleven more chapters - embeddings, chunking, the vector DB landscape, indexing tradeoffs, hybrid search, reranking, eval discipline, query understanding, drift and re-embedding, serving at scale. $39 -- PDF + EPUB + companion repo.

Get Semantic Search in Production $39 →

Published by Yaw Labs.