Why RAG Retrieval Fails: Chunk Size, Overlap, and the text-embedding-ada-002 Token Limit

Most RAG failures get blamed on the LLM. The model hallucinated. The model didn't follow the context. The model gave a generic answer. But in most cases the LLM is fine — the problem is that the retrieved chunks didn't contain the answer in the first place.

The two most common causes: chunk size is wrong (too small or too large), and overlap is zero.

What happens when chunks are too small

Consider a technical document about an API endpoint. You've chunked it at 128 tokens with no overlap. A chunk might look like this:

The rate limit applies to all endpoints under /v2/. Authenticated
requests use a sliding 60-second window. Each token counts against the
quota of the authenticating user, not the application account.

That's a coherent, self-contained fact. 128 tokens works here.

But most documents aren't uniformly structured. The next chunk might be:

See the previous section for context on how token quotas are calculated.
Exceeding the limit returns HTTP 429. The Retry-After header indicates

This chunk starts mid-thought, references "the previous section" (which is now a different chunk), and ends mid-sentence. It contains useful information, but a cosine similarity search against a user question like "what status code does rate limiting return?" will score this chunk lower than it deserves, because the fragment around "HTTP 429" is buried in a sentence that starts with a reference to something else.

The result: your retrieval returns k=3 results that look relevant by keyword but are missing critical context.

Rule of thumb: below 100 tokens, chunks are rarely self-contained enough for meaningful retrieval. The range 256–512 tokens is where most RAG practitioners land for general-purpose documents. Higher (512–1024) for legal or medical text where paragraphs are naturally dense.

What happens when chunks are too large

Going too large has the opposite problem: diluted relevance.

text-embedding-ada-002 produces a single 1536-dimensional vector for an entire chunk. That vector is a weighted average of all the semantic content in the chunk. A 1,000-token chunk about database backup policies, monitoring dashboards, and alert configuration will produce a vector that's equally "about" all three topics — and therefore not optimally close to any specific query.

When you ask "how do I configure alerts?", a 1,000-token chunk that contains alert configuration alongside 700 tokens of unrelated content will score lower than a 300-token chunk that's focused entirely on alert configuration.

Large chunks also hit token limits. text-embedding-ada-002 truncates inputs silently at 8,191 tokens. If your chunk is 10,000 tokens, the last 1,800 tokens are dropped entirely when the embedding is computed. The chunk appears in your index with an embedding that doesn't represent its full content.

You can check your token counts before indexing using tiktoken:

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")  # for ada-002

def check_chunks(chunks: list[str], limit: int = 8191) -> None:
    for i, chunk in enumerate(chunks):
        tokens = len(enc.encode(chunk))
        if tokens > limit:
            print(f"Chunk {i}: {tokens} tokens — EXCEEDS LIMIT, will be truncated")
        elif tokens < 50:
            print(f"Chunk {i}: {tokens} tokens — too small for meaningful retrieval")

Or use the RAG Chunk Inspector to see the token distribution visually before writing any indexing code.

The overlap problem

Zero overlap is the default in most tutorials and the wrong default in most real pipelines.

Consider a paragraph that spans a chunk boundary:

--- CHUNK 1 END ---
...The system uses RSA-256 for signing all tokens. The public key is
rotated every 24 hours at midnight UTC.

--- CHUNK 2 START ---
After rotation, old keys remain valid for a 6-hour grace period to
allow in-flight tokens to expire naturally...

A user asks: "how long are old signing keys valid after rotation?"

The answer — "6-hour grace period" — is in Chunk 2. But Chunk 2 starts with "After rotation," with no antecedent for what's being rotated. The cosine similarity between the question and Chunk 2 may be lower than it would be if the chunk had context. Meanwhile, Chunk 1 mentions key rotation but doesn't answer the question.

With 15% overlap, Chunk 2 starts inside Chunk 1's content, repeating the last ~75 tokens of Chunk 1. The phrase "RSA-256 for signing all tokens. The public key is rotated every 24 hours at midnight UTC" now appears at the start of Chunk 2. The embedding of Chunk 2 now includes that context, and retrieval scores improve.

Research consistently finds that 10–20% overlap provides the best precision-recall tradeoff. Beyond 25%, gains diminish while storage grows.

Diagnosing your current setup

Before tuning, know where you stand. These are the three failure modes and how to detect them:

Failure mode 1: Chunks too small

Symptoms: High recall (the right chunk is in top-k) but poor coherence in LLM answers. The model keeps saying "based on the context, I don't have enough information" even when you're sure the document covers the topic.

Detection: Count how many chunks are under 100 tokens. If more than 20% of your chunks are below 100 tokens, you have a fragmentation problem.

Fix: Increase chunk size, or switch from character-based to sentence-based chunking.

Failure mode 2: Chunks too large

Symptoms: Low recall. The correct answer exists in the document but doesn't appear in the top-k results even at k=5. Retrieval feels random.

Detection: Check your token distribution histogram. If the distribution is heavily right-skewed with many chunks in the 800–2000+ token range, they're likely diluting relevance.

Fix: Reduce chunk size, or switch from paragraph-based to token-based chunking.

Failure mode 3: Zero overlap, boundary-spanning facts

Symptoms: Answers to specific factual questions ("what is the limit?", "what does X return?") are wrong even when the answer is in the document. Questions about multi-step processes produce incomplete answers.

Detection: Find 5–10 questions you know the document can answer. Retrieve the top-3 chunks for each. Check how many answers span chunk boundaries. If more than half do, overlap is your problem.

Fix: Set overlap to 10–15%. For sentence-based chunking, overlap 1–2 sentences.

Embedding cost impact of chunk settings

One often-overlooked factor: chunk settings directly affect your Pinecone or Weaviate indexing cost.

text-embedding-ada-002 costs $0.0001 per 1,000 tokens. Suppose you're indexing a 100,000-token document:

Strategy	Overlap	Chunks	Total embedded tokens	Cost
Token 512	0%	195	99,840	$0.0100
Token 512	15%	230	117,760	$0.0118
Token 256	0%	390	99,840	$0.0100
Token 256	15%	460	117,760	$0.0118

At document scale, the difference is small. At corpus scale — 10,000 documents, 1M tokens each — it adds up:

0% overlap: $1,000 to index
15% overlap: $1,180 to index

The 18% cost increase is worth it for the retrieval improvement in most cases. But if you're dealing with a large corpus and tight budget, knowing the tradeoff helps you make an informed decision.

A practical tuning workflow

Pick a baseline — token-based, chunk size 512, overlap 10%. This works reasonably well for most document types.
Visualize it — use the RAG Chunk Inspector to see how your actual documents split. Look for: oversized chunks (approaching 8,191 tokens), undersized chunks (below 100 tokens), and where key facts land relative to chunk boundaries.
Run a small retrieval eval — 20 question-answer pairs, index 100 chunks, measure top-3 recall. This takes 20 minutes and tells you more than any benchmark.
Adjust one variable at a time — change chunk size OR overlap, not both. Measure after each change. If top-3 recall doesn't improve after two adjustments, your problem is probably not chunking — look at the embedding model or the query preprocessing.
Lock the configuration in code — hard-code the winning configuration (strategy, chunk size, overlap) in your indexing pipeline. If you change it later, you need to re-index everything. Make reruns cheap by designing your pipeline to be idempotent from the start.