SapotaCorp

LLM-as-Judge in RAG Pipelines: How to Automate Quality Evaluation Without Ground Truth

Most RAG teams rely on gut feel and ad-hoc spot-checks to assess quality — until a user complaint forces a real audit. LLM-as-Judge lets you automate quality evaluation continuously, without needing a labelled ground-truth dataset, by using one model to score another's output across faithfulness, relevance, and completeness dimensions.

LLM-as-Judge in RAG Pipelines: How to Automate Quality Evaluation Without Ground Truth

Key takeaways

  • LLM-as-Judge lets you score RAG outputs continuously without hand-labelled ground truth, making weekly or even per-request evaluation feasible at low cost.
  • Decompose quality into three separate dimensions — context relevance, faithfulness, and answer completeness — because each maps to a different failure mode with a different fix.
  • A small judge model (GPT-4o-mini, Claude Haiku) is accurate enough for scoring and costs roughly 50–100x less than using the production model for evaluation.
  • Prompt the judge to reason step-by-step before scoring; this chain-of-thought step significantly reduces inconsistent or arbitrary scores.
  • LLM-as-Judge scores drift when the judge model is updated; version-pin your judge and run a calibration check whenever you upgrade it.

Six months into a production deployment for a Vietnamese fintech company, our team got an escalation from the product owner. Users were complaining that the internal knowledge-base assistant gave confident answers that turned out to be wrong or incomplete. The system had been "passing testing" for months. When we asked for the test results, we got a Notion doc with forty queries colour-coded green and red by a junior engineer who had checked each one manually over a weekend.

That was the evaluation process. One person, one weekend, four weeks before launch, never run again.

We needed a way to score quality continuously — ideally on every request, certainly on a representative sample — without blocking the team on building another labelled dataset every sprint. That is when we built our LLM-as-Judge setup. A year later, it is part of every RAG project we deliver.

Why ground truth is the wrong starting point

The standard RAG evaluation playbook says: build a set of question-answer pairs, run your system against them, compute metrics like Answer Correctness against the gold answers. This is a solid approach for a stable, well-defined domain. It is almost impossible to execute in practice.

Ground truth datasets are expensive to build, become stale as the knowledge base evolves, and never cover the long tail of real user queries. The fintech client's knowledge base was updated every two weeks with new product documentation. A ground-truth set built before launch was partially wrong by month two.

LLM-as-Judge sidesteps the ground-truth problem by scoring outputs against the retrieved context, not against a pre-baked answer. The judge does not need to know the right answer — it only needs to evaluate whether the generated response is consistent with what was retrieved, whether the retrieved chunks are relevant to the question, and whether the response covers what the question asked. These are questions a capable model can answer from the inputs alone.

The three dimensions we judge separately

We made the mistake early on of asking the judge for a single "quality score." The number was meaningless. A response could score 0.7 because the retrieval was good but the generation hallucinated, or because the retrieval was poor but the generation faithfully repeated the partial information it had. The number looked identical; the fix was completely different.

We now score three dimensions independently.

Context Relevance. Does the retrieved context actually address the question? This catches retrieval failures — the wrong chunks surfaced, the right chunks buried below the top-k cutoff, or the query not matching the vocabulary of the source documents.

Faithfulness. Does the generated response make claims that are supported by the retrieved context? This catches generation hallucinations — the model adding facts from training data or confabulating plausible details that are not in the source.

Answer Completeness. Does the response address what the question actually asked, given what was available in the context? This is a softer dimension but catches truncated or deflecting answers where the system had the relevant information but failed to use it.

Each score is a number between 0 and 1. Each score has its own alert threshold. A drop in Context Relevance triggers an investigation into chunking and retrieval. A drop in Faithfulness triggers a review of the generation prompt or temperature. A drop in Answer Completeness usually points to a prompt that is too conservative about what it will say.

The judge prompt that actually works

The biggest mistake we see teams make when implementing LLM-as-Judge is asking the judge for a score without asking it to reason first. A bare "rate this response from 0 to 1" prompt produces inconsistent results that vary with rephrasing and are impossible to debug.

The pattern that works is to ask the judge to reason through the evaluation before producing a score:

FAITHFULNESS_JUDGE_PROMPT = """
You are evaluating whether an AI assistant's response is grounded in the provided context.

QUESTION:
{question}

RETRIEVED CONTEXT:
{context}

ASSISTANT RESPONSE:
{response}

Step 1: List every factual claim made in the ASSISTANT RESPONSE.
Step 2: For each claim, determine whether it is explicitly supported by the RETRIEVED CONTEXT.
  - Mark claims as SUPPORTED or UNSUPPORTED.
Step 3: Calculate a faithfulness score as: supported_claims / total_claims.
  - If the response makes no factual claims, score is 1.0.

Output your reasoning in Steps 1-2, then output:
SCORE: <number between 0 and 1>
"""

The chain-of-thought in Steps 1 and 2 forces the judge to enumerate claims before scoring rather than pattern-matching on the surface feel of the response. In our calibration tests, adding this step reduced variance in repeated scoring of the same input from ±0.18 to ±0.06.

For Context Relevance, the equivalent step is asking the judge to identify what the question is actually seeking, then check which sentences in the retrieved context address it:

CONTEXT_RELEVANCE_JUDGE_PROMPT = """
You are evaluating whether retrieved context is relevant to a question.

QUESTION:
{question}

RETRIEVED CONTEXT:
{context}

Step 1: Identify the core information need in the QUESTION (what must be known to answer it).
Step 2: For each sentence in RETRIEVED CONTEXT, mark it as RELEVANT or IRRELEVANT to the core information need.
Step 3: Calculate a relevance score as: relevant_sentences / total_sentences.

Output your reasoning in Steps 1-2, then output:
SCORE: <number between 0 and 1>

Wiring it into the pipeline

Our production setup runs the judge at two points: inline on every response, and in a daily batch job.

The inline check runs Faithfulness only, because it is the highest-stakes dimension and the cheapest to compute. We use a small judge model — currently GPT-4o-mini — to keep latency under 300ms and cost under $0.001 per request. If a response scores below 0.82, we do not return it to the user. The system either retries with a rephrased query, or returns a fallback message.

async def generate_with_faithfulness_gate(
    question: str,
    context: str,
    response: str,
    threshold: float = 0.82
) -> dict:
    judge_response = await judge_llm.complete(
        FAITHFULNESS_JUDGE_PROMPT.format(
            question=question,
            context=context,
            response=response
        )
    )
    score = parse_score(judge_response)
    return {
        "response": response if score >= threshold else None,
        "faithfulness_score": score,
        "passed_gate": score >= threshold
    }

The daily batch job runs all three dimensions on a random 5% sample of the previous day's traffic. Scores are written to a Postgres table. A GitHub Actions cron job picks up the table, computes 7-day rolling averages, and sends a Slack alert if any metric drops more than 0.05 from the prior week.

This is the setup in its entirety. No external LLMOps platform. About 200 lines of Python, a cron job, and a dashboard we built in 90 minutes with a Retool free tier.

What we found when we turned it on

For the fintech client, we ran the judge retroactively against two weeks of logged traffic from the previous month. The results were instructive.

Context Relevance averaged 0.71. This was the first hard evidence that the retrieval layer was the bottleneck, not the generation. The team had spent two sprints tweaking the generation prompt and changing models. The judge told us that 29% of retrieved context, on average, was not actually relevant to the question being asked. Hybrid search and metadata filtering brought that to 0.84 within one sprint.

Faithfulness averaged 0.88, which was acceptable but masked a tail. About 7% of responses scored below 0.75 — these were the responses users had been complaining about. After turning on the inline faithfulness gate, complaint volume dropped by roughly half in the first two weeks.

Answer Completeness averaged 0.79 across all queries, but for queries with multiple sub-questions embedded (common in the fintech domain: "what is the fee structure and how does it change for premium accounts?"), it dropped to 0.61. The prompt was instructed to be concise and was over-applying that instruction to multi-part questions. One prompt change fixed it.

The calibration problem you have to manage

LLM-as-Judge is not a set-and-forget system. The judge model's behaviour changes when the model is updated, even in patch versions. We discovered this when GPT-4o-mini received a minor update and our weekly average Faithfulness score jumped from 0.88 to 0.94 overnight — not because the RAG system improved, but because the judge became more lenient.

Our approach: version-pin the judge model, run a calibration check of 50 human-rated examples every time we upgrade it, and re-anchor the thresholds if the calibration delta is more than 0.05.

The 50 human-rated examples are not a full ground-truth dataset. They are a small anchor set used only to check whether the judge has drifted, not to evaluate the RAG system. Building 50 is achievable in a few hours; maintaining 50 is straightforward. The distinction matters: do not let the perfect (a full ground-truth dataset) be the enemy of the practical (a drift-detection anchor set).

How we approach this on every project

Before any RAG system goes to production, we wire in three things: an inline faithfulness gate at the response layer, a nightly batch scorer for context relevance and answer completeness on a traffic sample, and a weekly trend report. The gate runs against a small judge model. The batch job runs on the prior day's traffic sample. The trend report is a Slack message with three numbers and a sparkline.

The total engineering time to add this to an existing RAG system is two to three days. The cost per day of operation for a system handling a few thousand daily queries is under five dollars. The alternative — discovering quality issues from user complaints — costs far more in trust and in engineering time spent doing post-hoc forensics without instrumentation.

The judge is not a replacement for a proper evaluation dataset. When we have the time and the domain stability to build one, we build it. But on every project, regardless of whether a full eval set exists, the judge runs. It is the minimum viable quality signal that lets the team manage a production system rather than just hope it works.

If your team is running a RAG system in production without automated quality scoring, you are not flying blind — you are flying blind while the altimeter is in the trunk.

Engineering certifications

Sapota engineers hold credentials on RAG Systems. Each badge links to the individual engineer's credly profile.

Browse RAG Systems certs

Need this on your team?

Sapota engineers ship the patterns you read here. Two-week paid trial, direct pricing from $1,800/ engineer/month, no agency markup.

Get a quote
Contact Us Now

Share Your Story

We build trust by delivering what we promise – the first time and every time!

We'd love to hear your vision. Our IT experts will reach out to you during business hours to discuss making it happen.

WHY CHOOSE US

"Collaborate, Elevate, Celebrate where Associates - Create Project Excellence"

SapotaCorp beyond the IT industry standard, we are

  • Certificated
  • Assured quality
  • Extra maintenance

Tell us about your project