SapotaCorp

RAG Chunking Strategies: Fixed-Size, Semantic, and Sliding Window Compared

Chunking is the first decision that shapes everything downstream in a RAG pipeline, yet most teams treat it as an afterthought. After shipping production RAG systems for a Vietnamese bank and a regional fintech platform, we learned that the wrong chunking strategy alone can tank retrieval quality even when embeddings, vector search, and generation are all dialed in. Here is what we found comparing fixed-size, sliding window, and semantic chunking across real workloads.

RAG Chunking Strategies: Fixed-Size, Semantic, and Sliding Window Compared

Key takeaways

  • Fixed-size chunking is fast and predictable but breaks semantic boundaries — acceptable for homogeneous, short documents, harmful for dense regulatory or legal text.
  • Sliding window chunking mitigates boundary cuts by overlapping tokens, but doubles or triples your index size if you are not careful about overlap ratio.
  • Semantic chunking (split on embedding similarity drops) produces higher-quality chunks at the cost of preprocessing time — the right default for most production RAG systems.
  • Chunk size interacts with your embedding model's context window; mismatching them is a silent quality killer that only shows up in RAGAs faithfulness scores.
  • Our production default: semantic chunking with a parent-child index — store semantic chunks as children, retrieve parents for generation to balance precision and context.

The Moment We Realised Chunking Was the Actual Problem

We were three weeks into a RAG deployment for a Vietnamese bank's internal policy assistant. Retrieval looked fine in unit tests — cosine similarity scores above 0.85, top-3 recall solid. But the generated answers kept hallucinating clause numbers and mixing up conditions from different sections of the same regulatory document.

After two days of debugging prompts and reranker configurations, one of our engineers pulled up the raw chunks landing in the context window. Half of them started mid-sentence. Several split a numbered clause across two separate chunks, so the condition ("the borrower must...") landed in one chunk and its exception ("...unless the loan tenure exceeds...") landed in another, never to be retrieved together.

We had shipped with chunk size 512, overlap 50 — the default in almost every tutorial. That was the bug.

Why Chunking Matters More Than Most Teams Think

Before comparing strategies, it helps to frame what chunking actually does to your retrieval pipeline. When you embed a chunk, the vector you get represents the meaning of that chunk, not the meaning of the source document. At query time, you retrieve the chunks whose vectors are closest to the query vector. If a chunk contains half a thought, its vector is noisy — it partially represents the thought that got cut, and partially represents whatever the next unrelated sentence said.

This compounds downstream. A noisy chunk degrades the reranker's signal. A reranker working with incomplete clauses scores them lower, so they fall out of the final context. The LLM then generates from an incomplete picture and fills gaps with plausible-sounding fabrications. The failure looks like a hallucination problem. The root cause is a chunking problem.

Fixed-Size Chunking

Fixed-size chunking splits text into chunks of exactly N tokens (or characters), with an optional overlap of M tokens between consecutive chunks.

from langchain.text_splitter import TokenTextSplitter

splitter = TokenTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
)
chunks = splitter.split_text(document_text)

Why teams reach for it: it is simple, fast, and deterministic. You know exactly how many chunks you will generate, which makes index sizing predictable. For homogeneous short-form content — product descriptions, FAQ pairs, news summaries — it works well enough because individual items are unlikely to span a boundary.

Where it breaks: dense documents with hierarchical structure. Regulatory text, technical manuals, legal contracts. These have natural semantic units (articles, sections, clauses) that rarely align with a 512-token boundary. When a boundary cuts through a numbered condition, the embedding model receives a fragment, not a complete thought.

We saw this directly with the bank's policy documents. A circular on loan provisioning had clauses averaging 800-1200 tokens. Fixed 512-token chunks meant every clause was split at least once. Retrieval quality for clause-specific queries dropped 31% on our RAGAs context precision metric compared to our later semantic approach.

When we still use it: simple QA over short, homogeneous corpora where speed of indexing matters more than marginal retrieval quality. Internal tooling, not customer-facing systems.

Sliding Window Chunking

Sliding window is fixed-size with higher overlap — typically 20-50% of chunk size. The idea is that by repeating content across consecutive chunks, you reduce the chance that a key sentence appears only at the cut boundary of one chunk.

splitter = TokenTextSplitter(
    chunk_size=512,
    chunk_overlap=128,  # 25% overlap
)

What it buys you: boundary sensitivity decreases. A sentence that would have been split across chunks now appears whole in at least one chunk. Recall on single-sentence queries improves noticeably.

What it costs you: index size. A 25% overlap on a 1M-token corpus grows your chunk count by roughly 33%. A 50% overlap nearly doubles it. More chunks means more vectors to store, slower ANN search, and higher embedding API costs at indexing time. On the fintech project, we briefly ran a 40% overlap experiment and watched our Pinecone pod costs climb 60% with less than 5% retrieval improvement.

The other problem is redundancy in retrieved context. With high overlap, the top-5 retrieved chunks often contain the same sentences repeated across multiple chunks. The LLM sees near-duplicate information eating up its context window, leaving less room for diverse supporting evidence.

When we use it: as a fallback when we cannot afford semantic chunking preprocessing time, but the corpus has enough variability that pure fixed-size hurts. We keep overlap at 15-20% max — enough to blunt hard cuts, not enough to bloat the index.

Semantic Chunking

Semantic chunking abandons fixed boundaries entirely. Instead, you embed each sentence (or small sentence group), then compute similarity between consecutive sentences. Where similarity drops sharply, you place a chunk boundary. The intuition: a sharp drop in similarity between adjacent sentences usually marks a topic transition — exactly where you want to split.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=90,  # split at top 10% similarity drops
)
chunks = chunker.create_documents([document_text])

What it buys you: chunks that respect the document's own semantic structure. In our bank deployment, switching to semantic chunking cut the number of mid-clause splits from ~60% of clauses to under 8%. RAGAs context precision jumped from 0.61 to 0.79 on the regulatory QA test set. Faithfulness scores (whether generated answers stuck to retrieved content) improved from 0.72 to 0.84.

What it costs you: preprocessing time. Semantic chunking requires embedding every sentence before indexing, which is slower and more expensive than a simple token split. On a 500-document corpus, it added about 12 minutes to our indexing pipeline versus under 2 minutes for fixed-size. For real-time ingestion pipelines where documents arrive continuously, this matters.

Chunk size also becomes variable, which complicates index size estimation and can produce occasional very large or very small chunks if the similarity threshold is miscalibrated. We run a post-processing step that merges chunks below 100 tokens and splits chunks above 800 tokens to keep the distribution reasonable.

Our Production Default: Semantic + Parent-Child Index

After running both approaches in production, our current default combines semantic chunking with a parent-child retrieval architecture.

We index at two levels. The child level contains our semantic chunks — typically 150-400 tokens each, split at topic boundaries. The parent level contains larger passages (the original document sections, usually 800-1500 tokens) that the child chunks were derived from.

At query time, we retrieve child chunks (high precision, semantically tight vectors match queries well), then expand to their parent documents for the generation context. The LLM sees complete sections rather than isolated small chunks, which eliminates the problem of cut-off clauses while preserving the retrieval precision that small chunks provide.

# Retrieval step: fetch top-k child chunks
child_results = vector_store.similarity_search(query, k=5)

# Expansion step: fetch parent sections for each child
parent_ids = [chunk.metadata["parent_id"] for chunk in child_results]
parent_docs = document_store.get_by_ids(parent_ids)

# Generate from parents, not children
context = "\n\n".join([doc.page_content for doc in parent_docs])

This approach consistently outperforms either pure fixed-size or pure semantic chunking alone across every production system we have shipped in the past year.

How We Approach This on Every Project

Before choosing a chunking strategy, we answer three questions:

  1. What is the average semantic unit in this corpus? For regulatory documents, it is a clause. For product catalogues, it is an item description. Chunk size should approximate that unit.
  2. How structured is the document? Structured documents (with clear headings and numbered sections) respond well to semantic or structure-aware chunking. Unstructured prose (chat logs, emails) tolerates fixed-size better.
  3. What does our RAGAs evaluation say? We always run context precision, context recall, and faithfulness on a test set of 50-100 queries before committing to a chunking configuration. Intuition gets us to a starting point; RAGAs scores decide which strategy ships.

Chunking is not glamorous. But every retrieval failure we have debugged in the past two years traces back to it more often than to embedding model choice, vector database configuration, or prompt engineering. Get chunking right first, then optimise everything else.

Engineering certifications

Sapota engineers hold credentials on RAG Systems. Each badge links to the individual engineer's credly profile.

Browse RAG Systems certs

Need this on your team?

Sapota engineers ship the patterns you read here. Two-week paid trial, direct pricing from $1,800/ engineer/month, no agency markup.

Get a quote
Contact Us Now

Share Your Story

We build trust by delivering what we promise – the first time and every time!

We'd love to hear your vision. Our IT experts will reach out to you during business hours to discuss making it happen.

WHY CHOOSE US

"Collaborate, Elevate, Celebrate where Associates - Create Project Excellence"

SapotaCorp beyond the IT industry standard, we are

  • Certificated
  • Assured quality
  • Extra maintenance

Tell us about your project