SapotaCorp

Re-ranking in RAG: Why Cross-Encoders Beat Bi-Encoders for Final Relevance Scoring

When we deployed our first RAG system for a Vietnamese bank's internal knowledge base, retrieval precision was sitting at 61% — good enough to demo, nowhere near production-ready. Adding a cross-encoder re-ranker in the final stage pushed that number to 84% without touching the embedding model or chunking strategy. This post explains why that works, and how we structure re-ranking in every RAG pipeline we ship.

Re-ranking in RAG: Why Cross-Encoders Beat Bi-Encoders for Final Relevance Scoring

Key takeaways

  • Bi-encoders encode query and document independently, making them fast but blind to query-document interaction — that gap is exactly where cross-encoders win.
  • Cross-encoders process query and document together in a single forward pass, capturing token-level attention across both, which is why their relevance scores are dramatically more accurate.
  • Re-ranking is a two-stage strategy: retrieve broadly with a bi-encoder (top-50 to top-100), then re-score narrowly with a cross-encoder (top-5 to top-10) — you get accuracy without paying full cross-encoder latency on the whole corpus.
  • In production, cross-encoder latency of 20–80 ms per pair is acceptable when you cap candidate set size; batching and async calls keep p95 under 200 ms for typical RAG flows.
  • Off-the-shelf models like BAAI/bge-reranker-large or Cohere Rerank API are strong starting points, but domain-specific fine-tuning on real user queries delivers the biggest precision jump.

The Problem That Sent Us Back to the Drawing Board

We were six weeks into a document Q&A system for a Vietnamese bank — internal policy docs, compliance guidelines, product manuals. The embedding-based retrieval looked fine in unit tests. In production review with the client's compliance team, they'd ask a question like "What are the conditions for early loan termination without penalty?" and the top result would be a paragraph about standard termination procedures that technically mentioned "penalty" but had nothing to do with the conditions. The answer the LLM generated was plausible-sounding and wrong.

The root cause wasn't the LLM. It wasn't the chunking. It was retrieval surfacing documents that were topically adjacent but not query-relevant. The embedding similarity was high enough to pass the threshold — the semantic neighbourhood was right, the specific answer was not.

That's the structural limitation of bi-encoders, and it's the reason re-ranking exists.

What Bi-Encoders Actually Do (and Why That's a Problem)

Bi-encoders — the embedding models you use to build your vector index — are trained to map text into a dense vector space where similar texts land close together. At query time, you embed the query, run an approximate nearest-neighbour search, and return the closest document chunks.

The critical detail: the query and each document are encoded independently. The model never sees them together. Similarity is computed post-hoc as a dot product or cosine score between two separate embeddings.

This independence is exactly what makes bi-encoders fast and scalable. You precompute all document embeddings offline. At query time you only embed the query (one forward pass) and do a vector search. For a corpus of millions of chunks, this is the only practical approach.

But independence has a cost. The model can't attend to how specific tokens in the query relate to specific tokens in the document. "Conditions for early termination without penalty" and "standard termination procedures" share a lot of vocabulary. Their embeddings will be close. But a model that can read both simultaneously and compare them word by word would immediately see the query is asking about a narrow conditional case the second document doesn't address.

How Cross-Encoders Work

A cross-encoder takes the query and a candidate document as a single concatenated input and runs one forward pass through a transformer to produce a relevance score. The architecture is usually a BERT-style model with a classification head:

from sentence_transformers import CrossEncoder

model = CrossEncoder("BAAI/bge-reranker-large")

query = "Conditions for early loan termination without penalty"
candidates = [
    "Standard termination procedures require 30 days notice...",
    "Early termination is permitted without penalty when the borrower has maintained...",
    "Penalty fees apply to all accounts closed within the first 12 months..."
]

scores = model.predict([(query, doc) for doc in candidates])
ranked = sorted(zip(scores, candidates), reverse=True)

Because query and document tokens attend to each other inside the transformer, the model captures nuanced relevance signals — negation, conditional structure, entity overlap, semantic specificity. The relevance score it produces is far more calibrated than cosine similarity between independent embeddings.

The trade-off is obvious: you can't precompute anything. Every query-document pair requires a fresh inference. Running a cross-encoder over a million documents at query time is not viable. Which is why nobody does that.

The Two-Stage Architecture

The solution is to use bi-encoders and cross-encoders at different stages:

User query
    │
    ▼
[Stage 1: Bi-encoder retrieval]
Vector search → top-50 to top-100 candidates
    │
    ▼
[Stage 2: Cross-encoder re-ranking]
Score each candidate → sort → top-5 to top-10
    │
    ▼
LLM generation with re-ranked context

Stage 1 is cheap and fast — milliseconds regardless of corpus size. You're casting a wide net, prioritising recall over precision. Getting the right document into the candidate set is the goal; ranking it accurately is not.

Stage 2 is expensive per pair but the candidate set is small. Running a cross-encoder on 50 candidates at ~2 ms per pair is 100 ms — entirely acceptable for a user-facing system. You get the accuracy of a cross-encoder without the combinatorial cost.

Here's how we wire this up in practice:

from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np

bi_encoder = SentenceTransformer("BAAI/bge-large-en-v1.5")
cross_encoder = CrossEncoder("BAAI/bge-reranker-large")

def retrieve_and_rerank(query: str, corpus_chunks: list[str], top_k_retrieve: int = 50, top_k_rerank: int = 5):
    # Stage 1: bi-encoder retrieval
    query_embedding = bi_encoder.encode(query, normalize_embeddings=True)
    chunk_embeddings = bi_encoder.encode(corpus_chunks, normalize_embeddings=True, batch_size=64)
    scores = np.dot(chunk_embeddings, query_embedding)
    top_indices = np.argsort(scores)[::-1][:top_k_retrieve]
    candidates = [corpus_chunks[i] for i in top_indices]

    # Stage 2: cross-encoder re-ranking
    pairs = [(query, doc) for doc in candidates]
    rerank_scores = cross_encoder.predict(pairs, batch_size=16)
    reranked = sorted(zip(rerank_scores, candidates), reverse=True)

    return [doc for _, doc in reranked[:top_k_rerank]]

Production Considerations We've Learned the Hard Way

Candidate set size matters more than you think. We initially set top_k_retrieve to 20 to keep re-ranking fast. We were dropping relevant documents before the cross-encoder ever saw them. Moving to 50–100 candidates improved recall at Stage 1 and gave the cross-encoder more to work with. The latency increase was ~60 ms — worth it.

Batching cross-encoder inference. Cross-encoders are transformer models; they benefit from batched inputs. Don't iterate pairs one by one. Pass the full list to predict() with an appropriate batch size. On a single A10G GPU we process 64 pairs in roughly the same time as 8 pairs sequentially.

Hosted re-ranking APIs for getting to production fast. When we need to ship quickly without managing GPU infrastructure, Cohere's Rerank API is our go-to:

import cohere

co = cohere.Client(api_key="...")

results = co.rerank(
    query=query,
    documents=candidates,
    model="rerank-multilingual-v3.0",
    top_n=5
)
ranked_docs = [r.document["text"] for r in results.results]

The multilingual model handles Vietnamese documents reasonably well. For a recent fintech project we used this in staging and only moved to a self-hosted model once query volume justified the GPU cost.

Domain fine-tuning is the real unlock. Off-the-shelf rerankers are trained on general web data. They perform well on broad queries but can miss domain-specific relevance patterns. For the banking client, "margin call" and "collateral top-up" are near-synonyms in context — a general model doesn't know that. We collected 400 query-document pairs from real user sessions (labeled relevant/not-relevant by the client's team) and fine-tuned bge-reranker-base for three epochs. That single fine-tuning step added another 7 percentage points of precision on the client's eval set.

Score calibration between stages. Don't try to combine bi-encoder scores and cross-encoder scores into a single ranking formula — they live in incompatible spaces. Treat Stage 1 purely as a filter and Stage 2 as the authoritative ranking. The cross-encoder score is what you sort on.

When Re-ranking Doesn't Help

Re-ranking fixes ranking errors, not retrieval gaps. If the right document was never indexed, or your chunking strategy splits a critical passage across two chunks such that neither chunk is coherent on its own, the cross-encoder can't save you. We've seen teams add re-ranking and see no improvement — almost always because the underlying retrieval recall was below 40%. Fix chunking and retrieval first. Re-ranking is a precision layer, not a recall fix.

Also: for latency-critical applications (sub-100 ms SLA), cross-encoder re-ranking on the critical path may not be viable. In those cases, consider ColBERT-style late interaction as a middle ground — it encodes query and document separately but retains token-level vectors for a richer similarity computation than a single embedding. We covered ColBERT in a previous post in this series.

How We Approach This on Every Project

Our default RAG architecture now always includes a re-ranking stage. The specific setup depends on the project:

  • Fast prototypes: Cohere Rerank API, top_k_retrieve=30, top_k_rerank=5. Ship in a day.
  • Production with budget: Self-hosted bge-reranker-large on a shared GPU instance, async inference, candidate set of 50–80.
  • Production with domain specificity: Fine-tuned reranker on client-labeled data, evaluated monthly against a held-out query set using precision@5 and NDCG@5.

The 23-point precision improvement we saw on the banking project came primarily from adding re-ranking — not from switching embedding models, not from changing chunk size. It's the highest-leverage single addition we've made to RAG pipelines, and it's one of the first things we reach for when a client says "the answers aren't quite right."

Bi-encoders are how you search at scale. Cross-encoders are how you get the answer right. You need both.

Engineering certifications

Sapota engineers hold credentials on RAG Systems. Each badge links to the individual engineer's credly profile.

Browse RAG Systems certs

Need this on your team?

Sapota engineers ship the patterns you read here. Two-week paid trial, direct pricing from $1,800/ engineer/month, no agency markup.

Get a quote
Contact Us Now

Share Your Story

We build trust by delivering what we promise – the first time and every time!

We'd love to hear your vision. Our IT experts will reach out to you during business hours to discuss making it happen.

WHY CHOOSE US

"Collaborate, Elevate, Celebrate where Associates - Create Project Excellence"

SapotaCorp beyond the IT industry standard, we are

  • Certificated
  • Assured quality
  • Extra maintenance

Tell us about your project