The Moment We Realised Chunking Was the Actual Problem
We were three weeks into a RAG deployment for a Vietnamese bank's internal policy assistant. Retrieval looked fine in unit tests — cosine similarity scores above 0.85, top-3 recall solid. But the generated answers kept hallucinating clause numbers and mixing up conditions from different sections of the same regulatory document.
After two days of debugging prompts and reranker configurations, one of our engineers pulled up the raw chunks landing in the context window. Half of them started mid-sentence. Several split a numbered clause across two separate chunks, so the condition ("the borrower must...") landed in one chunk and its exception ("...unless the loan tenure exceeds...") landed in another, never to be retrieved together.
We had shipped with chunk size 512, overlap 50 — the default in almost every tutorial. That was the bug.
Why Chunking Matters More Than Most Teams Think
Before comparing strategies, it helps to frame what chunking actually does to your retrieval pipeline. When you embed a chunk, the vector you get represents the meaning of that chunk, not the meaning of the source document. At query time, you retrieve the chunks whose vectors are closest to the query vector. If a chunk contains half a thought, its vector is noisy — it partially represents the thought that got cut, and partially represents whatever the next unrelated sentence said.
This compounds downstream. A noisy chunk degrades the reranker's signal. A reranker working with incomplete clauses scores them lower, so they fall out of the final context. The LLM then generates from an incomplete picture and fills gaps with plausible-sounding fabrications. The failure looks like a hallucination problem. The root cause is a chunking problem.
Fixed-Size Chunking
Fixed-size chunking splits text into chunks of exactly N tokens (or characters), with an optional overlap of M tokens between consecutive chunks.
from langchain.text_splitter import TokenTextSplitter
splitter = TokenTextSplitter(
chunk_size=512,
chunk_overlap=50,
)
chunks = splitter.split_text(document_text)
Why teams reach for it: it is simple, fast, and deterministic. You know exactly how many chunks you will generate, which makes index sizing predictable. For homogeneous short-form content — product descriptions, FAQ pairs, news summaries — it works well enough because individual items are unlikely to span a boundary.
Where it breaks: dense documents with hierarchical structure. Regulatory text, technical manuals, legal contracts. These have natural semantic units (articles, sections, clauses) that rarely align with a 512-token boundary. When a boundary cuts through a numbered condition, the embedding model receives a fragment, not a complete thought.
We saw this directly with the bank's policy documents. A circular on loan provisioning had clauses averaging 800-1200 tokens. Fixed 512-token chunks meant every clause was split at least once. Retrieval quality for clause-specific queries dropped 31% on our RAGAs context precision metric compared to our later semantic approach.
When we still use it: simple QA over short, homogeneous corpora where speed of indexing matters more than marginal retrieval quality. Internal tooling, not customer-facing systems.
Sliding Window Chunking
Sliding window is fixed-size with higher overlap — typically 20-50% of chunk size. The idea is that by repeating content across consecutive chunks, you reduce the chance that a key sentence appears only at the cut boundary of one chunk.
splitter = TokenTextSplitter(
chunk_size=512,
chunk_overlap=128, # 25% overlap
)
What it buys you: boundary sensitivity decreases. A sentence that would have been split across chunks now appears whole in at least one chunk. Recall on single-sentence queries improves noticeably.
What it costs you: index size. A 25% overlap on a 1M-token corpus grows your chunk count by roughly 33%. A 50% overlap nearly doubles it. More chunks means more vectors to store, slower ANN search, and higher embedding API costs at indexing time. On the fintech project, we briefly ran a 40% overlap experiment and watched our Pinecone pod costs climb 60% with less than 5% retrieval improvement.
The other problem is redundancy in retrieved context. With high overlap, the top-5 retrieved chunks often contain the same sentences repeated across multiple chunks. The LLM sees near-duplicate information eating up its context window, leaving less room for diverse supporting evidence.
When we use it: as a fallback when we cannot afford semantic chunking preprocessing time, but the corpus has enough variability that pure fixed-size hurts. We keep overlap at 15-20% max — enough to blunt hard cuts, not enough to bloat the index.
Semantic Chunking
Semantic chunking abandons fixed boundaries entirely. Instead, you embed each sentence (or small sentence group), then compute similarity between consecutive sentences. Where similarity drops sharply, you place a chunk boundary. The intuition: a sharp drop in similarity between adjacent sentences usually marks a topic transition — exactly where you want to split.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
chunker = SemanticChunker(
embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=90, # split at top 10% similarity drops
)
chunks = chunker.create_documents([document_text])
What it buys you: chunks that respect the document's own semantic structure. In our bank deployment, switching to semantic chunking cut the number of mid-clause splits from ~60% of clauses to under 8%. RAGAs context precision jumped from 0.61 to 0.79 on the regulatory QA test set. Faithfulness scores (whether generated answers stuck to retrieved content) improved from 0.72 to 0.84.
What it costs you: preprocessing time. Semantic chunking requires embedding every sentence before indexing, which is slower and more expensive than a simple token split. On a 500-document corpus, it added about 12 minutes to our indexing pipeline versus under 2 minutes for fixed-size. For real-time ingestion pipelines where documents arrive continuously, this matters.
Chunk size also becomes variable, which complicates index size estimation and can produce occasional very large or very small chunks if the similarity threshold is miscalibrated. We run a post-processing step that merges chunks below 100 tokens and splits chunks above 800 tokens to keep the distribution reasonable.
Our Production Default: Semantic + Parent-Child Index
After running both approaches in production, our current default combines semantic chunking with a parent-child retrieval architecture.
We index at two levels. The child level contains our semantic chunks — typically 150-400 tokens each, split at topic boundaries. The parent level contains larger passages (the original document sections, usually 800-1500 tokens) that the child chunks were derived from.
At query time, we retrieve child chunks (high precision, semantically tight vectors match queries well), then expand to their parent documents for the generation context. The LLM sees complete sections rather than isolated small chunks, which eliminates the problem of cut-off clauses while preserving the retrieval precision that small chunks provide.
# Retrieval step: fetch top-k child chunks
child_results = vector_store.similarity_search(query, k=5)
# Expansion step: fetch parent sections for each child
parent_ids = [chunk.metadata["parent_id"] for chunk in child_results]
parent_docs = document_store.get_by_ids(parent_ids)
# Generate from parents, not children
context = "\n\n".join([doc.page_content for doc in parent_docs])
This approach consistently outperforms either pure fixed-size or pure semantic chunking alone across every production system we have shipped in the past year.
How We Approach This on Every Project
Before choosing a chunking strategy, we answer three questions:
- What is the average semantic unit in this corpus? For regulatory documents, it is a clause. For product catalogues, it is an item description. Chunk size should approximate that unit.
- How structured is the document? Structured documents (with clear headings and numbered sections) respond well to semantic or structure-aware chunking. Unstructured prose (chat logs, emails) tolerates fixed-size better.
- What does our RAGAs evaluation say? We always run context precision, context recall, and faithfulness on a test set of 50-100 queries before committing to a chunking configuration. Intuition gets us to a starting point; RAGAs scores decide which strategy ships.
Chunking is not glamorous. But every retrieval failure we have debugged in the past two years traces back to it more often than to embedding model choice, vector database configuration, or prompt engineering. Get chunking right first, then optimise everything else.








