SapotaCorp

RAG Evaluation with RAGAs: Faithfulness, Context Recall, and Answer Relevance

When a Vietnamese bank's internal AI assistant started confidently quoting compliance rules that did not exist in any document, the team discovered they had been testing the wrong thing entirely. This post walks through how we set up RAGAs evaluation on that project, what faithfulness, context recall, and answer relevance each actually measure, and how the three metrics together gave us a diagnostic that the "looks good in demo" approach never could.

RAG Evaluation with RAGAs: Faithfulness, Context Recall, and Answer Relevance

Key takeaways

  • Faithfulness catches the generation layer hallucinating facts that contradict the retrieved context — a score below 0.85 in production means confident wrong answers are reaching users.
  • Context recall is a retrieval-layer metric: a faithfulness score of 1.0 combined with low context recall means the system is accurately summarizing incomplete information, which is just as dangerous.
  • Answer relevance penalises over-hedged, padded responses that technically address the question but bury the useful content — high faithfulness and recall can still coexist with low relevance.
  • The LLM-as-judge pattern makes RAGAs practical without a large human-labelled dataset: a small model like gpt-4o-mini as the evaluator costs under $5 per 100-question weekly eval run.
  • Wire up a real-time faithfulness gate in the response path before launch — blocking answers below the threshold and returning a 'cannot find a confident answer' fallback cuts user-reported wrong answers significantly.

Six months into production, a compliance assistant we built for a Vietnamese bank started failing in a specific and alarming way. The retrieval logs showed the system was finding the right regulatory documents. The generation logs showed coherent, well-structured answers. But the answers included penalty thresholds and approval conditions that did not appear in any retrieved chunk — and in some cases directly contradicted the source text.

The team had been running a green-emoji / red-emoji Notion sheet as their eval process. Manual spot-checks every two weeks. The failure mode had probably been present for a month before anyone caught it, and it took a domain expert on the client side — not engineering — to raise the flag.

This is what "testing the wrong thing" looks like in production RAG. The system appeared to work because retrieval was working. The generation step was hallucinating confidently, and nothing in our monitoring caught it.

Setting up RAGAs properly is what we should have shipped from week one.

What RAGAs actually is

RAGAs (Retrieval Augmented Generation Assessment) is an evaluation framework that decomposes RAG quality into separate, independently measurable metrics. Instead of a single "accuracy" score — which collapses structurally different failure modes into one number — RAGAs gives you a score per layer of the pipeline.

The three metrics that carry the most diagnostic weight in our production work are faithfulness, context recall, and answer relevance. Each one points at a different place in the stack when something is wrong.

Faithfulness: is the answer grounded in the retrieved context?

Faithfulness answers the question: of everything the system claimed in its response, how much of it is actually supported by the retrieved chunks?

The computation works as follows. An LLM-as-judge breaks the generated answer into atomic claims — individual factual statements — and then checks each one against the retrieved context. The score is the proportion of claims that are grounded.

from ragas import evaluate
from ragas.metrics import faithfulness
from datasets import Dataset

data = {
    "question": ["What is the penalty for late filing?"],
    "answer": ["The penalty is 2% per month up to a maximum of 10%."],
    "contexts": [["Late filing incurs a 2% monthly penalty capped at 10% of the assessed amount."]],
    "ground_truth": ["The penalty is 2% per month with a 10% ceiling."]
}

dataset = Dataset.from_dict(data)
result = evaluate(dataset, metrics=[faithfulness])
print(result["faithfulness"])  # 1.0 — both claims are grounded

In the bank project, faithfulness on a 120-question eval set came out at 0.71. That meant roughly three in ten claims the system made were not traceable to any retrieved document. For a compliance assistant, that number is catastrophic.

The practical threshold we work from: above 0.90 for general-purpose assistants, above 0.95 for anything compliance-sensitive. Below 0.85 in production, the system is manufacturing facts at a rate users cannot detect.

The fix that moves the metric fastest is a real-time faithfulness gate in the response path. After generating the answer, score it with a lightweight judge model. If the score falls below the threshold, return a fallback response — something like "I cannot find a confident answer for this in the knowledge base" — instead of shipping the hallucination. For the bank project, adding this gate cut user-reported wrong answers by about 55% before we had fixed any of the underlying chunking or retrieval issues.

Context recall: did retrieval surface what was needed?

Context recall measures whether the retrieval layer actually brought back the information needed to answer the question correctly.

The computation requires a ground-truth answer. A judge model identifies the atomic facts present in the ground-truth answer and checks how many of those facts exist in the retrieved chunks. A score of 0.6 means 40% of the facts required to answer correctly were never retrieved — and no amount of prompt engineering will recover them at generation time.

from ragas.metrics import context_recall

result = evaluate(dataset, metrics=[context_recall])
print(result["context_recall"])

This metric is the one that reveals retrieval as the actual bottleneck, which teams frequently misidentify as a generation problem. We have seen projects where the team spent three weeks rewriting system prompts because the assistant was giving incomplete answers, when context recall was sitting at 0.55 and the information simply was not making it into the prompt.

The floor we target is 0.80. Below 0.70 is a hard signal to pause on prompt work and fix retrieval first. Practical levers: increase top-k chunks retrieved, switch to hybrid search (BM25 plus dense vector), add query expansion or HyDE, or revisit chunk size and overlap. Context recall tells you which direction to pull.

For the bank project, context recall was 0.68 on the initial eval. The primary issue was that regulatory documents had been chunked at a fixed 512-token boundary that split many rule definitions across two chunks. The retrieval system would find one chunk but not the adjacent one. Switching to sentence-window retrieval — indexing individual sentences but expanding the retrieved context to include surrounding sentences — brought context recall from 0.68 to 0.84 in one iteration.

Answer relevance: does the answer actually address the question?

Answer relevance captures something different from the first two metrics. A response can be fully faithful to the context and successfully recall all required information, and still score poorly on relevance if it is padded, evasive, or addresses a slightly different question than the one asked.

The computation uses a reverse-generation approach. The judge model generates several hypothetical questions that the given answer could plausibly be answering, then measures the average cosine similarity between those hypothetical questions and the actual user question. If the generated answer is truly on-topic, the hypothetical questions should cluster tightly around the original.

from ragas.metrics import answer_relevancy

result = evaluate(dataset, metrics=[answer_relevancy])
print(result["answer_relevancy"])

We see this metric drop most often in two situations. First, when the system prompt instructs the model to "be comprehensive and include relevant context" — the model adds background information the user did not ask for, and the answer drifts from the question. Second, when the retrieved context contains partially relevant chunks and the model tries to hedge by addressing multiple possible interpretations of the question.

The floor we target is 0.80. The fix is usually prompt surgery: tighten the instruction to answer the specific question asked rather than provide comprehensive coverage, and reduce the number of retrieved chunks if the system is consistently pulling in tangential material.

Running all three together

In practice we run all three metrics together on every eval iteration. The combination tells you which layer to fix first.

Pattern Diagnosis
Low faithfulness, high recall, high relevance Generation hallucinating despite good context
High faithfulness, low recall, high relevance Retrieval not surfacing enough information
High faithfulness, high recall, low relevance Prompt causing drift from the actual question
Low faithfulness, low recall, low relevance Everything needs work — start with retrieval
from ragas.metrics import faithfulness, context_recall, answer_relevancy

result = evaluate(
    dataset,
    metrics=[faithfulness, context_recall, answer_relevancy]
)
print(result.to_pandas()[["faithfulness", "context_recall", "answer_relevancy"]])

The LLM-as-judge approach means you do not need a massive human-labelled dataset to get started. We typically use gpt-4o-mini as the judge model for cost reasons. A 100-question eval set runs under $5 per evaluation pass. We schedule this as a weekly GitHub Actions cron against the production pipeline and alert on any metric that drops more than five percentage points week over week.

How we approach this on every project

The minimum viable evaluation setup we ship before any RAG system goes to production:

Before launch. Build a ground-truth eval set of at least 80 questions. Include long-tail queries from the alpha, edge cases the domain expert flagged, and any question type where the answer changes frequently (regulations, pricing, policies). For each question, write a ground-truth answer. It does not need to be perfect — trends matter more than absolute scores.

At launch. Wire up the real-time faithfulness gate. This is the single highest-leverage addition to the response path. Generate the answer, score it, return the fallback if it is below threshold. Every project gets this before the first user touches it.

After launch. Run the full three-metric eval weekly. Track the trend lines, not just the current snapshot. A two-week downward trend in context recall usually signals that new documents added to the corpus have broken the chunking assumptions, or that the query distribution has shifted.

For the bank project, the eval setup took three days. The faithfulness gate took one afternoon. The context recall fix (sentence-window retrieval) took a sprint. At the three-month mark, faithfulness was at 0.93, context recall at 0.87, and answer relevance at 0.84. The compliance expert who had originally raised the alarm became the system's most vocal internal advocate.

The difference between shipping a demo and shipping a defensible system is the three numbers. Get them before the first real user does.

Engineering certifications

Sapota engineers hold credentials on RAG Systems. Each badge links to the individual engineer's credly profile.

Browse RAG Systems certs

Need this on your team?

Sapota engineers ship the patterns you read here. Two-week paid trial, direct pricing from $1,800/ engineer/month, no agency markup.

Get a quote
Contact Us Now

Share Your Story

We build trust by delivering what we promise – the first time and every time!

We'd love to hear your vision. Our IT experts will reach out to you during business hours to discuss making it happen.

WHY CHOOSE US

"Collaborate, Elevate, Celebrate where Associates - Create Project Excellence"

SapotaCorp beyond the IT industry standard, we are

  • Certificated
  • Assured quality
  • Extra maintenance

Tell us about your project