HyDE: Why Your RAG Embeddings Miss and How Hypothetical Document Embeddings Fix It

Dense retrieval works by embedding a query and finding documents with similar embeddings in a vector space. The assumption is that similar text lands near similar text. But queries and documents are not similar text. A query is a short phrase written by a user who does not know the answer. A document is a long passage written by someone who does. They use different vocabulary, different sentence length, and different register — and this difference is encoded in their embeddings.

This mismatch is the query-document gap, and it is the most common reason a RAG pipeline retrieves the wrong chunks even when the right answer exists in the corpus.

HyDE (Hypothetical Document Embeddings) is a technique from the 2023 paper "Precise Zero-Shot Dense Retrieval without Relevance Labels" by Luyu Gao et al. at Carnegie Mellon. The idea is simple: instead of embedding the query, use an LLM to generate a hypothetical document that would answer the query, then embed that. The hypothetical document lives in answer space — the same space as your corpus — so it retrieves better.

The query-document gap in concrete terms

Take the query: Does chunk overlap improve RAG retrieval?

Embedded directly, this lands in a region of the vector space populated by other question-form text. It shares no vocabulary with a corpus chunk that reads:

Chunk overlap significantly improves recall for questions that span chunk boundaries, with 15–20% overlap providing the best precision-recall tradeoff.

The chunk uses terms like "overlap", "recall", "boundaries", "precision-recall tradeoff". The query uses "chunk overlap" and "improve" — two matches out of the chunk's entire vocabulary.

A standard embedding model like all-MiniLM-L6-v2 sees the semantic relationship between the two texts, but the representation distance is still larger than it would be between two passages that both write in the same register. In practice this means the chunk ranks lower than it should, especially in competitive corpora where other chunks also have partial matches.

What HyDE does instead

HyDE inserts one step between the query and the vector search:

query → LLM (Claude Haiku 4.5 / GPT-4o mini) → hypothetical answer → embed → search

The LLM generates a short passage that would answer the query, written in the same style as a document — not a chatbot response. A default system prompt like this produces the right style:

Write a concise 3–5 sentence passage that directly answers the question. Write as an authoritative reference document, not a chatbot. Use domain-specific vocabulary. Output only the passage text.

For the query above, Claude Haiku 4.5 produces something like:

Chunk overlap significantly improves retrieval recall for queries that span chunk boundaries. Research comparing chunking strategies found that 15–20% overlap provides the optimal precision-recall tradeoff, as overlapping text ensures that contextually connected information is not severed at hard boundaries. Without overlap, a fixed-size chunking strategy frequently splits semantic units mid-sentence, causing relevant information to be missed even when it exists in the corpus.

This hypothetical answer now shares substantial vocabulary with the correct chunk: "overlap", "recall", "chunk boundaries", "precision-recall tradeoff", "fixed-size", "semantic units". The embedding of this passage lands much closer to the correct chunk in the vector space.

The result: the correct chunk goes from rank #4 under direct query embedding to rank #1 under HyDE.

When HyDE helps most

HyDE provides the largest improvements in these scenarios:

Short or sparse queries. One-phrase queries like retry failed API requests or find similar protein sequences have almost no vocabulary overlap with their answer documents. HyDE generates a rich passage that bridges the gap. The longer and more specific the query, the smaller HyDE's advantage — because a detailed query already has substantial vocabulary overlap with the answer.

Technical or domain-specific corpora. Legal text, genomics papers, clinical notes, and code documentation use highly specialized vocabulary that is unlikely to appear in a user's query. The LLM-generated hypothetical answer uses that vocabulary naturally. In the HyDE Visualizer, the Clinical Medicine and Bioinformatics examples show the largest rank improvements across all chunks.

Multi-hop or complex queries. If a query asks about a relationship between two concepts, the direct query embedding tries to represent that relationship in one vector. The hypothetical answer can develop the relationship across several sentences, giving the embedding more signal to work with.

When HyDE underperforms

HyDE has a known failure mode: if the LLM generates a confidently wrong hypothetical answer, it retrieves wrong chunks with high scores.

For the query Can I get my money back? in a customer support corpus, a well-calibrated LLM generates an answer about 14-day refund windows and cancellation policy. That answer retrieves the correct FAQ entries.

But if the LLM hallucinates — generating an answer about "30-day return windows" and "restocking fees" that do not match the actual policy — the retrieval will score chunks that contain those incorrect details highly. HyDE is only as reliable as the LLM generation step.

This tradeoff is visible in the HyDE Visualizer: for the same query, switching from the pre-built hypothetical answer to a custom-prompted one changes which chunks rank first. The embedding model (all-MiniLM-L6-v2 or BGE-Small-en-v1.5) does not fact-check the hypothetical document — it only measures semantic similarity.

Implementing HyDE in LangChain and LlamaIndex

LangChain does not have a built-in HyDE retriever, but the pattern is three lines:

from langchain_anthropic import ChatAnthropic
from langchain_huggingface import HuggingFaceEmbeddings

llm = ChatAnthropic(model="claude-haiku-4-5-20251001", max_tokens=300)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

hyde_answer = llm.invoke(f"Write a passage that answers: {query}").content
results = vectorstore.similarity_search_by_vector(embeddings.embed_query(hyde_answer), k=5)

LlamaIndex has a native HyDEQueryTransform that wraps any query engine:

from llama_index.core.indices.query.query_transform.base import HyDEQueryTransform
from llama_index.core.query_engine import TransformQueryEngine

hyde = HyDEQueryTransform(include_original=True)
query_engine = TransformQueryEngine(index.as_query_engine(similarity_top_k=5), query_transform=hyde)
response = query_engine.query("Does chunk overlap improve RAG retrieval?")

The include_original=True flag runs both the direct query and the HyDE query and merges results — a safer default that avoids the failure mode of pure HyDE.

The cost of one extra LLM call

HyDE adds one LLM call per query. At Claude Haiku 4.5 pricing ($0.80/M input + $4.00/M output tokens), a typical hypothetical document of ~100 words costs roughly $0.00002 per query — about $0.60 per 10,000 queries, or $18/month at 100,000 queries/day.

GPT-4o mini is cheaper: $0.15/M input + $0.60/M output, approximately $0.09 per 10,000 queries.

The HyDE Visualizer computes this breakdown automatically after generating a live hypothetical answer, based on your actual token counts.

Comparing embedding models

The magnitude of HyDE's improvement depends partly on the embedding model. all-MiniLM-L6-v2 (22 MB, fast) and BGE-Small-en-v1.5 (29 MB, stronger on technical text) both show HyDE improvements, but the rank order of individual chunks can differ between models. The HyDE Visualizer lets you switch between both models after running an analysis — the LLM generation is cached, so only the similarity computation reruns.

For production use, text-embedding-ada-002 (OpenAI) and text-embedding-3-small have higher capacity than small open models, but HyDE still helps because the underlying query-document gap is a property of the input texts, not the model.

What to try next

Load the HyDE Visualizer, pick the RAG Research or Python Code pre-built example, and run the comparison. Look at the rank delta badges — which chunks moved up, which moved down, and what bridge terms explain the shift. Then try the same query with your own corpus to see whether HyDE is worth adding to your pipeline.