Six months into production, a compliance assistant we built for a Vietnamese bank started failing in a specific and alarming way. The retrieval logs showed the system was finding the right regulatory documents. The generation logs showed coherent, well-structured answers. But the answers included penalty thresholds and approval conditions that did not appear in any retrieved chunk — and in some cases directly contradicted the source text.
The team had been running a green-emoji / red-emoji Notion sheet as their eval process. Manual spot-checks every two weeks. The failure mode had probably been present for a month before anyone caught it, and it took a domain expert on the client side — not engineering — to raise the flag.
This is what "testing the wrong thing" looks like in production RAG. The system appeared to work because retrieval was working. The generation step was hallucinating confidently, and nothing in our monitoring caught it.
Setting up RAGAs properly is what we should have shipped from week one.
What RAGAs actually is
RAGAs (Retrieval Augmented Generation Assessment) is an evaluation framework that decomposes RAG quality into separate, independently measurable metrics. Instead of a single "accuracy" score — which collapses structurally different failure modes into one number — RAGAs gives you a score per layer of the pipeline.
The three metrics that carry the most diagnostic weight in our production work are faithfulness, context recall, and answer relevance. Each one points at a different place in the stack when something is wrong.
Faithfulness: is the answer grounded in the retrieved context?
Faithfulness answers the question: of everything the system claimed in its response, how much of it is actually supported by the retrieved chunks?
The computation works as follows. An LLM-as-judge breaks the generated answer into atomic claims — individual factual statements — and then checks each one against the retrieved context. The score is the proportion of claims that are grounded.
from ragas import evaluate
from ragas.metrics import faithfulness
from datasets import Dataset
data = {
"question": ["What is the penalty for late filing?"],
"answer": ["The penalty is 2% per month up to a maximum of 10%."],
"contexts": [["Late filing incurs a 2% monthly penalty capped at 10% of the assessed amount."]],
"ground_truth": ["The penalty is 2% per month with a 10% ceiling."]
}
dataset = Dataset.from_dict(data)
result = evaluate(dataset, metrics=[faithfulness])
print(result["faithfulness"]) # 1.0 — both claims are grounded
In the bank project, faithfulness on a 120-question eval set came out at 0.71. That meant roughly three in ten claims the system made were not traceable to any retrieved document. For a compliance assistant, that number is catastrophic.
The practical threshold we work from: above 0.90 for general-purpose assistants, above 0.95 for anything compliance-sensitive. Below 0.85 in production, the system is manufacturing facts at a rate users cannot detect.
The fix that moves the metric fastest is a real-time faithfulness gate in the response path. After generating the answer, score it with a lightweight judge model. If the score falls below the threshold, return a fallback response — something like "I cannot find a confident answer for this in the knowledge base" — instead of shipping the hallucination. For the bank project, adding this gate cut user-reported wrong answers by about 55% before we had fixed any of the underlying chunking or retrieval issues.
Context recall: did retrieval surface what was needed?
Context recall measures whether the retrieval layer actually brought back the information needed to answer the question correctly.
The computation requires a ground-truth answer. A judge model identifies the atomic facts present in the ground-truth answer and checks how many of those facts exist in the retrieved chunks. A score of 0.6 means 40% of the facts required to answer correctly were never retrieved — and no amount of prompt engineering will recover them at generation time.
from ragas.metrics import context_recall
result = evaluate(dataset, metrics=[context_recall])
print(result["context_recall"])
This metric is the one that reveals retrieval as the actual bottleneck, which teams frequently misidentify as a generation problem. We have seen projects where the team spent three weeks rewriting system prompts because the assistant was giving incomplete answers, when context recall was sitting at 0.55 and the information simply was not making it into the prompt.
The floor we target is 0.80. Below 0.70 is a hard signal to pause on prompt work and fix retrieval first. Practical levers: increase top-k chunks retrieved, switch to hybrid search (BM25 plus dense vector), add query expansion or HyDE, or revisit chunk size and overlap. Context recall tells you which direction to pull.
For the bank project, context recall was 0.68 on the initial eval. The primary issue was that regulatory documents had been chunked at a fixed 512-token boundary that split many rule definitions across two chunks. The retrieval system would find one chunk but not the adjacent one. Switching to sentence-window retrieval — indexing individual sentences but expanding the retrieved context to include surrounding sentences — brought context recall from 0.68 to 0.84 in one iteration.
Answer relevance: does the answer actually address the question?
Answer relevance captures something different from the first two metrics. A response can be fully faithful to the context and successfully recall all required information, and still score poorly on relevance if it is padded, evasive, or addresses a slightly different question than the one asked.
The computation uses a reverse-generation approach. The judge model generates several hypothetical questions that the given answer could plausibly be answering, then measures the average cosine similarity between those hypothetical questions and the actual user question. If the generated answer is truly on-topic, the hypothetical questions should cluster tightly around the original.
from ragas.metrics import answer_relevancy
result = evaluate(dataset, metrics=[answer_relevancy])
print(result["answer_relevancy"])
We see this metric drop most often in two situations. First, when the system prompt instructs the model to "be comprehensive and include relevant context" — the model adds background information the user did not ask for, and the answer drifts from the question. Second, when the retrieved context contains partially relevant chunks and the model tries to hedge by addressing multiple possible interpretations of the question.
The floor we target is 0.80. The fix is usually prompt surgery: tighten the instruction to answer the specific question asked rather than provide comprehensive coverage, and reduce the number of retrieved chunks if the system is consistently pulling in tangential material.
Running all three together
In practice we run all three metrics together on every eval iteration. The combination tells you which layer to fix first.
| Pattern | Diagnosis |
|---|---|
| Low faithfulness, high recall, high relevance | Generation hallucinating despite good context |
| High faithfulness, low recall, high relevance | Retrieval not surfacing enough information |
| High faithfulness, high recall, low relevance | Prompt causing drift from the actual question |
| Low faithfulness, low recall, low relevance | Everything needs work — start with retrieval |
from ragas.metrics import faithfulness, context_recall, answer_relevancy
result = evaluate(
dataset,
metrics=[faithfulness, context_recall, answer_relevancy]
)
print(result.to_pandas()[["faithfulness", "context_recall", "answer_relevancy"]])
The LLM-as-judge approach means you do not need a massive human-labelled dataset to get started. We typically use gpt-4o-mini as the judge model for cost reasons. A 100-question eval set runs under $5 per evaluation pass. We schedule this as a weekly GitHub Actions cron against the production pipeline and alert on any metric that drops more than five percentage points week over week.
How we approach this on every project
The minimum viable evaluation setup we ship before any RAG system goes to production:
Before launch. Build a ground-truth eval set of at least 80 questions. Include long-tail queries from the alpha, edge cases the domain expert flagged, and any question type where the answer changes frequently (regulations, pricing, policies). For each question, write a ground-truth answer. It does not need to be perfect — trends matter more than absolute scores.
At launch. Wire up the real-time faithfulness gate. This is the single highest-leverage addition to the response path. Generate the answer, score it, return the fallback if it is below threshold. Every project gets this before the first user touches it.
After launch. Run the full three-metric eval weekly. Track the trend lines, not just the current snapshot. A two-week downward trend in context recall usually signals that new documents added to the corpus have broken the chunking assumptions, or that the query distribution has shifted.
For the bank project, the eval setup took three days. The faithfulness gate took one afternoon. The context recall fix (sentence-window retrieval) took a sprint. At the three-month mark, faithfulness was at 0.93, context recall at 0.87, and answer relevance at 0.84. The compliance expert who had originally raised the alarm became the system's most vocal internal advocate.
The difference between shipping a demo and shipping a defensible system is the three numbers. Get them before the first real user does.








