Token vs Sentence vs Paragraph Chunking in RAG: Which Strategy Fits Your Documents
Compare token-based, sentence-boundary, and paragraph-boundary chunking strategies for RAG pipelines using tiktoken, LangChain, LlamaIndex, Pinecone, and Weaviate.
Chunking is the first decision you make when building a RAG pipeline, and it's the one most engineers get wrong. Not because the concept is hard — split a document into pieces, embed each piece, store them in Pinecone or Weaviate — but because the wrong strategy causes retrieval failures that look like model failures, and those are hard to debug.
This post compares the four main chunking strategies, explains where each one breaks down, and gives you a framework for choosing based on your document type.
The four strategies
Token-based chunking splits text into fixed windows measured in tokens, using a tokenizer like tiktoken's o200k_base (GPT-4o) or cl100k_base (GPT-3.5, GPT-4). You set a target token count — say, 512 — and every chunk is exactly that many tokens, minus overlap.
Character-based chunking is the same concept but measured in characters instead of tokens. Simpler to reason about, but token counts vary per chunk because some words tokenize to more tokens than others. LangChain's CharacterTextSplitter uses this approach.
Sentence-boundary chunking groups whole sentences until the token target is reached. It never cuts mid-sentence. LlamaIndex's SentenceSplitter and LangChain's NLTKTextSplitter both implement variants of this. The resulting chunks follow natural language boundaries, which tends to produce more coherent retrieved passages.
Paragraph-boundary chunking groups paragraphs (separated by blank lines) until the token target is reached. It preserves thematic document units. This is the right default for structured documents like technical manuals, legal contracts, and medical records.
Where each strategy produces bad chunks
Token-based cutting mid-sentence
CHUNK #1: "The authorization endpoint requires a PKCE code_challenge parameter.
The code_challenge is derived from a code_verifier using SHA-256 hashing. The
verifier must be between 43 and 128 characters long. The endpoint will reject"
CHUNK #2: "requests that omit the challenge entirely. Clients should store the
verifier securely for the duration of the authorization flow."
The split lands mid-thought. If a user asks "what does the endpoint reject?", the relevant sentence is split across two chunks. With retrieval at k=1, they get half the answer.
Sentence chunking on dense technical text
Sentence-boundary chunking assumes sentences are meaningful units of retrieval. That works for prose. For code documentation or gene sequences, individual sentences often lack standalone context:
CHUNK #7: "See section 4.2 for the full parameter list."
That's a complete sentence — and a completely useless chunk to retrieve.
Paragraph chunking on unstructured data
Some documents don't have paragraph breaks. A PDF extracted with a parser like PyMuPDF or pdfplumber may come out as a single block of text with no blank lines. Paragraph chunking falls back to treating the entire document as one chunk, often exceeding your embedding model's token limit.
For reference: text-embedding-ada-002 supports up to 8,191 tokens. text-embedding-3-small and text-embedding-3-large support up to 8,191 as well. Anything longer gets truncated silently.
Comparing strategies on the same document
Run the same 2,000-word technical document through all four strategies at chunk size 512, overlap 10%:
| Strategy | Chunks | Avg tokens | Token range | Uniformity |
|---|---|---|---|---|
| Token | 5 | 512 | 512–512 | Perfect |
| Character | 6 | 428 | 380–512 | Good |
| Sentence | 7 | 368 | 124–512 | Moderate |
| Paragraph | 4 | 612 | 287–940 | Poor |
Token-based produces the most uniform chunks. Paragraph-based produces the widest range — one paragraph happened to be 940 tokens, which means it would be truncated by ada-002 at 8,191 tokens if the document were longer.
You can reproduce this comparison with the RAG Chunk Inspector — paste your document, switch to Compare mode, and see all four strategies side by side.
Overlap: always set it to 10–20%
Every strategy benefits from overlap. The reason is simple: if a key fact sits at the boundary of two chunks, without overlap it appears in neither chunk's complete form. With 10–15% overlap, that boundary content appears in both adjacent chunks, so retrieval can find it regardless of which chunk scores highest.
The downside is redundant storage and higher embedding cost. At 15% overlap, you're storing roughly 15% more data and paying for 15% more embedding API calls. For most corpora, that tradeoff is worth it.
At 25%+ overlap, you start hitting diminishing returns. Retrieval precision improves marginally, but storage grows noticeably.
Choosing a strategy
Use token-based when:
- Your retrieval system is built around a specific embedding model limit (ada-002, text-embedding-3-small)
- You need predictable, consistent chunk sizes
- Your documents are mixed-format (code, tables, prose combined)
Use sentence-based when:
- Your documents are prose-heavy: articles, reports, customer support transcripts
- Questions are likely to be answered by single sentences or short passages
- You're using a question-answering model and want clean passage retrieval
Use paragraph-based when:
- Your documents have consistent paragraph structure
- You're working with legal contracts, technical manuals, or academic papers where paragraphs are meaningful standalone units
- Questions tend to require full-paragraph context
Use character-based rarely — it's mainly useful as a quick baseline when you don't want to depend on a tokenizer.
What the distribution tells you
After chunking, look at the token distribution histogram. The shape tells you whether your settings are working:
- Tight spike (all bars clustered): Token or character chunking — consistent sizes, good for production.
- Bell curve: Sentence chunking — most sentences fall in a midrange, expected behavior.
- Right-skewed (most chunks are large): Paragraph chunking on dense docs — check that none exceed 8,191 tokens.
- Left-skewed (mostly tiny chunks): Your chunk size is too small, or the document has very short paragraphs/sentences. Chunks under ~50 tokens typically lack enough context to be useful.
The warning signals to act on immediately: bars near the right edge (approaching the embedding model limit) and bars concentrated at the left below 50 tokens.
Practical advice before you index
-
Test your actual documents — not example text. Real documents have edge cases: footnotes, headers, bullet lists, code blocks. All of them interact differently with the four strategies.
-
Measure token counts, not character counts — your embedding model bills in tokens and truncates in tokens. Character count is a proxy that breaks on tokenizer-unfamiliar languages.
-
Run a small retrieval evaluation before indexing your full corpus. Take 20 question-answer pairs from your domain, index 50 chunks, and measure whether the correct chunk is in the top-3 results. That tells you more than any benchmark.
-
Plan for reruns — changing your chunking strategy means re-embedding and re-indexing. Make your pipeline idempotent from the start so reruns are cheap.
Try ContextIQ free
Free tools for AI engineers — no sign-up required.
Follow Trango Compute on LinkedIn
We post updates on new tools, context engineering patterns, and LLM cost research.