The Problem That Sent Us Back to the Drawing Board
We were six weeks into a document Q&A system for a Vietnamese bank — internal policy docs, compliance guidelines, product manuals. The embedding-based retrieval looked fine in unit tests. In production review with the client's compliance team, they'd ask a question like "What are the conditions for early loan termination without penalty?" and the top result would be a paragraph about standard termination procedures that technically mentioned "penalty" but had nothing to do with the conditions. The answer the LLM generated was plausible-sounding and wrong.
The root cause wasn't the LLM. It wasn't the chunking. It was retrieval surfacing documents that were topically adjacent but not query-relevant. The embedding similarity was high enough to pass the threshold — the semantic neighbourhood was right, the specific answer was not.
That's the structural limitation of bi-encoders, and it's the reason re-ranking exists.
What Bi-Encoders Actually Do (and Why That's a Problem)
Bi-encoders — the embedding models you use to build your vector index — are trained to map text into a dense vector space where similar texts land close together. At query time, you embed the query, run an approximate nearest-neighbour search, and return the closest document chunks.
The critical detail: the query and each document are encoded independently. The model never sees them together. Similarity is computed post-hoc as a dot product or cosine score between two separate embeddings.
This independence is exactly what makes bi-encoders fast and scalable. You precompute all document embeddings offline. At query time you only embed the query (one forward pass) and do a vector search. For a corpus of millions of chunks, this is the only practical approach.
But independence has a cost. The model can't attend to how specific tokens in the query relate to specific tokens in the document. "Conditions for early termination without penalty" and "standard termination procedures" share a lot of vocabulary. Their embeddings will be close. But a model that can read both simultaneously and compare them word by word would immediately see the query is asking about a narrow conditional case the second document doesn't address.
How Cross-Encoders Work
A cross-encoder takes the query and a candidate document as a single concatenated input and runs one forward pass through a transformer to produce a relevance score. The architecture is usually a BERT-style model with a classification head:
from sentence_transformers import CrossEncoder
model = CrossEncoder("BAAI/bge-reranker-large")
query = "Conditions for early loan termination without penalty"
candidates = [
"Standard termination procedures require 30 days notice...",
"Early termination is permitted without penalty when the borrower has maintained...",
"Penalty fees apply to all accounts closed within the first 12 months..."
]
scores = model.predict([(query, doc) for doc in candidates])
ranked = sorted(zip(scores, candidates), reverse=True)
Because query and document tokens attend to each other inside the transformer, the model captures nuanced relevance signals — negation, conditional structure, entity overlap, semantic specificity. The relevance score it produces is far more calibrated than cosine similarity between independent embeddings.
The trade-off is obvious: you can't precompute anything. Every query-document pair requires a fresh inference. Running a cross-encoder over a million documents at query time is not viable. Which is why nobody does that.
The Two-Stage Architecture
The solution is to use bi-encoders and cross-encoders at different stages:
User query
│
▼
[Stage 1: Bi-encoder retrieval]
Vector search → top-50 to top-100 candidates
│
▼
[Stage 2: Cross-encoder re-ranking]
Score each candidate → sort → top-5 to top-10
│
▼
LLM generation with re-ranked context
Stage 1 is cheap and fast — milliseconds regardless of corpus size. You're casting a wide net, prioritising recall over precision. Getting the right document into the candidate set is the goal; ranking it accurately is not.
Stage 2 is expensive per pair but the candidate set is small. Running a cross-encoder on 50 candidates at ~2 ms per pair is 100 ms — entirely acceptable for a user-facing system. You get the accuracy of a cross-encoder without the combinatorial cost.
Here's how we wire this up in practice:
from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np
bi_encoder = SentenceTransformer("BAAI/bge-large-en-v1.5")
cross_encoder = CrossEncoder("BAAI/bge-reranker-large")
def retrieve_and_rerank(query: str, corpus_chunks: list[str], top_k_retrieve: int = 50, top_k_rerank: int = 5):
# Stage 1: bi-encoder retrieval
query_embedding = bi_encoder.encode(query, normalize_embeddings=True)
chunk_embeddings = bi_encoder.encode(corpus_chunks, normalize_embeddings=True, batch_size=64)
scores = np.dot(chunk_embeddings, query_embedding)
top_indices = np.argsort(scores)[::-1][:top_k_retrieve]
candidates = [corpus_chunks[i] for i in top_indices]
# Stage 2: cross-encoder re-ranking
pairs = [(query, doc) for doc in candidates]
rerank_scores = cross_encoder.predict(pairs, batch_size=16)
reranked = sorted(zip(rerank_scores, candidates), reverse=True)
return [doc for _, doc in reranked[:top_k_rerank]]
Production Considerations We've Learned the Hard Way
Candidate set size matters more than you think. We initially set top_k_retrieve to 20 to keep re-ranking fast. We were dropping relevant documents before the cross-encoder ever saw them. Moving to 50–100 candidates improved recall at Stage 1 and gave the cross-encoder more to work with. The latency increase was ~60 ms — worth it.
Batching cross-encoder inference. Cross-encoders are transformer models; they benefit from batched inputs. Don't iterate pairs one by one. Pass the full list to predict() with an appropriate batch size. On a single A10G GPU we process 64 pairs in roughly the same time as 8 pairs sequentially.
Hosted re-ranking APIs for getting to production fast. When we need to ship quickly without managing GPU infrastructure, Cohere's Rerank API is our go-to:
import cohere
co = cohere.Client(api_key="...")
results = co.rerank(
query=query,
documents=candidates,
model="rerank-multilingual-v3.0",
top_n=5
)
ranked_docs = [r.document["text"] for r in results.results]
The multilingual model handles Vietnamese documents reasonably well. For a recent fintech project we used this in staging and only moved to a self-hosted model once query volume justified the GPU cost.
Domain fine-tuning is the real unlock. Off-the-shelf rerankers are trained on general web data. They perform well on broad queries but can miss domain-specific relevance patterns. For the banking client, "margin call" and "collateral top-up" are near-synonyms in context — a general model doesn't know that. We collected 400 query-document pairs from real user sessions (labeled relevant/not-relevant by the client's team) and fine-tuned bge-reranker-base for three epochs. That single fine-tuning step added another 7 percentage points of precision on the client's eval set.
Score calibration between stages. Don't try to combine bi-encoder scores and cross-encoder scores into a single ranking formula — they live in incompatible spaces. Treat Stage 1 purely as a filter and Stage 2 as the authoritative ranking. The cross-encoder score is what you sort on.
When Re-ranking Doesn't Help
Re-ranking fixes ranking errors, not retrieval gaps. If the right document was never indexed, or your chunking strategy splits a critical passage across two chunks such that neither chunk is coherent on its own, the cross-encoder can't save you. We've seen teams add re-ranking and see no improvement — almost always because the underlying retrieval recall was below 40%. Fix chunking and retrieval first. Re-ranking is a precision layer, not a recall fix.
Also: for latency-critical applications (sub-100 ms SLA), cross-encoder re-ranking on the critical path may not be viable. In those cases, consider ColBERT-style late interaction as a middle ground — it encodes query and document separately but retains token-level vectors for a richer similarity computation than a single embedding. We covered ColBERT in a previous post in this series.
How We Approach This on Every Project
Our default RAG architecture now always includes a re-ranking stage. The specific setup depends on the project:
- Fast prototypes: Cohere Rerank API,
top_k_retrieve=30,top_k_rerank=5. Ship in a day. - Production with budget: Self-hosted
bge-reranker-largeon a shared GPU instance, async inference, candidate set of 50–80. - Production with domain specificity: Fine-tuned reranker on client-labeled data, evaluated monthly against a held-out query set using precision@5 and NDCG@5.
The 23-point precision improvement we saw on the banking project came primarily from adding re-ranking — not from switching embedding models, not from changing chunk size. It's the highest-leverage single addition we've made to RAG pipelines, and it's one of the first things we reach for when a client says "the answers aren't quite right."
Bi-encoders are how you search at scale. Cross-encoders are how you get the answer right. You need both.








