Dense vectors handle paraphrase; BM25 handles literal-phrase tail. The art is fusing them so neither side loses.
Pure dense retrieval breaks on the literal-phrase tail. Ask for "Patagonia R1 men's medium" and a vector search confidently returns "Arc'teryx Atom LT large" -- conceptually similar, semantically wrong. Pure BM25 breaks on paraphrase: the user asks "warm midlayer for ski touring," BM25 has never seen those exact tokens together, and the relevant items don't surface.
Hybrid search is the answer most production systems converge on. This guide covers the patterns that work, the fusion methods, and the failure modes that show up only when you have real query traffic. The full chapter is Chapter 6 of Semantic Search in Production.
Dense and sparse retrieval have different strengths:
The intersection of the two failure modes is small. The union of their strengths covers most real query traffic. That's the whole pitch.
Reciprocal Rank Fusion (RRF) is the default people reach for: combine two ranked lists by summing 1/(k + rank) for each document. It works because it normalizes naturally -- raw scores from BM25 and from a vector index are not on the same scale and can't be added directly.
The default k=60 is fine for "I want hybrid working today." But RRF has known limits:
k moves all queries together; you can't fix the literal-tail without breaking the paraphrase performance.The next step up: normalize each side's scores to a [0, 1] range, then take a weighted sum:
final_score = alpha * normalize(dense_score) + (1 - alpha) * normalize(bm25_score)
Now alpha is a knob you can tune per query class. The challenge is what "normalize" means -- min-max over the current result set is the easy choice and works most of the time; z-score normalization handles outliers better but requires a stats baseline.
The real unlock: classify the query (literal-phrase vs conceptual vs mixed) and pick alpha per query. A query containing a brand or model number leans heavier on BM25; a conversational paraphrase leans heavier on dense. This is where evaluation becomes load-bearing -- you cannot tune per-query weighting without a query-class-stratified eval set.
The naive implementation:
function pickAlpha(query) {
if (looksLikeBrandOrModel(query)) return 0.3; // lean BM25
if (looksConversational(query)) return 0.8; // lean dense
return 0.5;
}
It's not deep ML; it's a classifier that captures the eyeball pattern.
Some vector DBs (Weaviate, Qdrant, OpenSearch) do hybrid in a single query. Others require you to fetch from sparse and dense separately and fuse on the client. The single-call path is faster and simpler -- but only works as well as the DB's fusion logic, which is often RRF. If you need score-normalized or per-query weighting, you usually need two-pass.
Semantic Search in Production Chapter 6 covers RRF in depth, the score-normalization patterns, per-query weighting with code, single-call-vs-two-pass tradeoffs, and the eval discipline (Chapter 8) that makes any of this tuneable.
Semantic Search in Production
The book on hybrid search and RAG retrieval. Twelve chapters. PDF + EPUB. Free updates as the field moves. $39 one-time, secure checkout.
Read more & buy $39 →Published by Yaw Labs.