A common pattern we see: a Series A team builds a RAG assistant, runs a 50-question internal demo, ships to production, and within two weeks the support inbox is full of "the AI gave me a wrong answer" tickets. Nothing changed between Tuesday's demo and Friday's outage. The same model, the same retrieval, the same prompt template.
What changed is the question distribution. Internal demos are written by people who already know the corpus. Production users are not. The four failure modes below show up almost every time.
Pitfall 1: nearest neighbor always returns something
Vector search does not have a "nothing matched" mode. It returns the top-k closest chunks regardless of whether any of them actually answer the question. Cosine similarity of 0.62 looks roughly like cosine similarity of 0.78 to a model that just consumes the chunks as prompt context.
The result is a confident hallucination on every long-tail query. A user asks about a feature the team has not documented yet, and the assistant returns a fluent paragraph stitched together from the closest tangentially related content. The user has no way to know the answer is wrong.
The Sapota fix is two-pronged. First, set a similarity floor (typically cosine 0.7 to 0.75 depending on the embedding model) and return "I do not have this in my knowledge base" when the top result falls below it. Second, add a faithfulness check using either Ragas or a smaller LLM-as-judge that verifies the generated answer is grounded in the retrieved context before it ships to the user. The faithfulness gate alone catches around 40% of the hallucinations we see in audits.
Pitfall 2: chunk size was picked once and never revisited
Most teams pick chunk_size=512 because that is the default in the framework they started with, and never look at it again. This is fine for blog-post-style content. It is not fine for a corpus that mixes blog posts, research papers, code documentation, and contracts.
Two failure modes follow:
- Chunks too small. Formula references like "applying equation (3)" lose the equation. Code references lose the function definition. Legal clauses lose their parent section.
- Chunks too large. The top-1 result returns a 2000-token chunk where the relevant sentence is buried in the middle. The LLM gets distracted by the surrounding noise and answers the wrong sub-question.
The Sapota playbook is to pick chunk size per content type, not per corpus. Markdown documentation gets recursive splitting on headings. Research papers get hierarchical chunking with parent-section metadata. Contracts get section-level chunks with cross-reference resolution. The infrastructure cost of mixing strategies is small. The recall difference is not.
If the corpus is genuinely homogeneous, run a sweep at 256, 512, and 1024 tokens against a 100-question eval set and pick the winner. Do not eyeball it.
Pitfall 3: there is no observability, so degradation is invisible
This is the one that quietly kills RAG products six months in. The team ships, things look fine, the founder moves on to the next feature. Three months later the corpus has grown by 40%, the embedding distribution has drifted, and recall has dropped from 85% to 62%. Nobody notices because nobody is looking.
The minimum observability stack we ship with every RAG project:
- Per-query logging of the user question, the retrieved chunk IDs, the retrieval scores, the generated answer, and the latency at each stage (embedding, vector search, LLM generation).
- A weekly eval cron that runs a fixed 100-question ground-truth set through the production pipeline and tracks recall@5, faithfulness, and answer correctness over time. We use Ragas for the metric layer and Opik or Langfuse for the trace storage.
- An alert when the weekly score drops more than 5% week-over-week, which is the signal that something in the corpus or the model has shifted.
The expensive version of this is a full LLMOps platform. The cheap version is a Postgres table, a cron job, and a Slack webhook. Both work. What does not work is having neither.
Pitfall 4: single-step retrieval on multi-hop questions
User asks: "Which of our enterprise customers complained about the new pricing tier in Q4?" This is a three-hop question: find enterprise customers, filter by Q4 timeframe, find the ones with complaints about pricing. A single vector search over a chunked CRM corpus will not find the answer. It will find chunks that mention "enterprise" and "pricing" and stitch together something plausible.
The fix is one of three patterns, depending on how often these queries show up:
- Query decomposition. A planning LLM breaks the question into sub-queries, each gets a separate retrieval, and a final synthesis LLM combines the results. Adds latency and cost, but works for any multi-hop pattern.
- Graph RAG. If the corpus has clear entities and relationships (customer, order, complaint, pricing tier), a graph database with multi-hop traversal handles these natively and is faster than agentic decomposition.
- Structured query routing. Some "RAG" questions are not RAG questions at all. They are SQL questions disguised as natural language. Route these to a SQL agent against the actual database and skip the vector store entirely.
The default we recommend is to add a router as the first step. A small fast model (Llama 3.1 8B is enough) classifies whether the query is single-hop, multi-hop, or structured, and dispatches accordingly. The cost is one extra LLM call per query. The accuracy gain on the 15 to 25% of queries that are actually multi-hop is worth it.
What an audit actually looks like
When a team brings us in after a launch goes wrong, the first 48 hours are diagnostic, not engineering. We ask for:
- The full production query log for the last two weeks (anonymized).
- The eval set the team used before launch.
- Three to five user-reported wrong answers with the original questions and what the user expected.
The query log usually shows that 60% of production queries fall outside the distribution the eval set covered. The eval set was written by the engineers, who think like engineers. Production users ask questions in a different shape.
The diagnostic deliverable is a one-page document mapping each user complaint to which of the four pitfalls caused it, with a recommended fix and the order to ship them in. Most teams can ship the highest-impact two fixes in the first week and recover 70% of the lost user trust.
If your launch is going sideways
If the AI assistant your team shipped is getting worse instead of better, or if the eval scores look fine but the user feedback says otherwise, that is the gap an audit closes. Sapota runs a fixed-scope two-week diagnostic engagement that produces the document above plus the implementation plan for the first three fixes.
Reach out via the AI engineering page with a description of what you are seeing in production. The first conversation is free and almost always surfaces at least one of the four pitfalls within thirty minutes.