We were nine weeks into a document assistant project for a fintech client when a compliance officer asked a question that the system consistently got wrong. The document it needed was in the index. The query was semantically clear. Hybrid search was already enabled. We had spent two sprints tuning chunk size and reranking weights. The system still missed it.
The culprit turned out to be the embedding model. We had inherited text-embedding-ada-002 from the team's original prototype. When we swapped to text-embedding-3-large with the same chunks and the same retrieval configuration, the failing test case passed. A 100-question eval run confirmed the pattern: recall@5 moved from 0.71 to 0.87.
That experience is why embedding model selection is the first thing we audit when a RAG pipeline underperforms. Most teams treat it as a one-time decision made during the tutorial phase and never revisit it. In practice it is one of the highest-leverage variables in the system.
What the embedding model is actually doing
The embedding model converts text into a fixed-length vector. When a user asks a question, the query goes through the same model and becomes a vector. Retrieval is then a nearest-neighbor search in that vector space.
The critical property is the model's learned notion of semantic distance. If "loan repayment schedule" and "amortization table" land close together in the model's space, retrieval works. If they do not, no amount of reranking fixes the problem, because the reranker only sees the top-k chunks the embedding model already surfaced.
This means the embedding model defines the ceiling for retrieval quality. Everything else in the pipeline — chunking, hybrid search, reranking, metadata filtering — operates within that ceiling. Raising the ceiling is worth the effort.
The three variables that actually move retrieval quality
1. Domain fit
General-purpose embedding models are trained on web text. A model that excels at matching a question about machine learning to a Wikipedia article may struggle to match a legal query like "force majeure clause scope" to a contract excerpt, because the semantic structure of legal language diverges from web prose.
We have found two reliable signals for domain mismatch:
- The model's MTEB leaderboard score is high but in-domain eval recall is below 0.75.
- The failing queries tend to use domain-specific phrasing that does not appear in general web text (regulatory codes, contract language, clinical terminology, internal product names).
When both signals appear, the options are fine-tuning a base model on domain pairs, or switching to a domain-specific model if one exists. For legal text, legal-bert-based embeddings outperform general models. For code documentation, code-search-babbage and the newer CodeBERT variants maintain precision on identifier-heavy queries that general models dilute.
Fine-tuning on 2,000 to 5,000 curated (query, relevant-chunk) pairs typically recovers 8 to 15 percentage points of recall versus an off-the-shelf model on a narrow domain. The cost is annotation time. The payoff is structural.
2. Language coverage
This is the variable that bit us on the fintech project. The corpus was primarily Vietnamese with some English headings. text-embedding-ada-002 was trained predominantly on English data. Queries in Vietnamese were often closer in the model's space to unrelated English chunks than to the correct Vietnamese document.
The practical rule: if more than 20% of your corpus or query distribution is in a non-English language, use a multilingual model. The models we reach for first:
- BGE-M3 (BAAI): supports over 100 languages, strong on Vietnamese, available via Hugging Face and as a managed API. On the MIRACL multilingual retrieval benchmark it outperforms
multilingual-e5-largeon Vietnamese and Thai by a noticeable margin. - multilingual-e5-large (Microsoft): solid across European and East Asian languages, slightly smaller than BGE-M3, easier to self-host on a T4.
- text-embedding-3-large (OpenAI): improved multilingual support over ada-002 but still English-first. Adequate for light multilingual workloads but not the right tool when the corpus is predominantly non-English.
The temptation is to translate all queries to English at runtime and use a strong English model. We have tried this. The latency adds up (one extra LLM call per query), and translation introduces its own errors on domain-specific terminology. Multilingual models are the cleaner solution.
3. Embedding dimensions
Embedding dimensions control the resolution of the vector space. Higher dimensions mean finer-grained distinctions, which generally improves recall on semantically complex queries. The cost is memory and query latency, both of which scale linearly with dimension count.
Approximate numbers from our deployments:
| Dimensions | Memory per 1M chunks | p50 query latency (Qdrant) |
|---|---|---|
| 384 | ~1.5 GB | ~18ms |
| 768 | ~3 GB | ~35ms |
| 1536 | ~6 GB | ~70ms |
| 3072 | ~12 GB | ~140ms |
For most production workloads under five million chunks, 768 dimensions is the sweet spot. The recall gain going from 768 to 1536 is real but small (typically 2 to 4 percentage points on our eval sets). The gain from 384 to 768 is larger (6 to 10 points) and worth the resource cost.
When a client has a large corpus and cost is a constraint, binary quantization bridges the gap. Converting 1536d float32 vectors to binary (one bit per dimension) compresses storage by 32x with under 5% recall loss when paired with a full-precision rerank on the top candidates. We have written about this separately — the short version is that binary quantization is the right tool for indexes above 10 million chunks, not for smaller workloads where the extra complexity is not worth it.
How we benchmark before committing
The MTEB leaderboard (huggingface.co/spaces/mteb/leaderboard) is the starting point for shortlisting models, not the final answer. MTEB scores correlate well with general retrieval quality but correlate weakly with in-domain recall for specialized corpora.
Our evaluation process before committing to a model:
from ragas import evaluate
from ragas.metrics import context_recall, context_precision
from datasets import Dataset
# Build a 100-question eval set: (question, ground_truth_chunk_ids)
# Run each candidate model through the same retrieval pipeline
# Compare context_recall@5 across candidates
candidate_models = [
"BAAI/bge-m3",
"intfloat/multilingual-e5-large",
"text-embedding-3-large",
]
results = {}
for model_name in candidate_models:
pipeline = build_retrieval_pipeline(embedding_model=model_name)
retrieved = pipeline.batch_retrieve(eval_questions, k=5)
dataset = Dataset.from_list([
{
"question": q,
"contexts": retrieved[i],
"ground_truth": ground_truths[i],
}
for i, q in enumerate(eval_questions)
])
score = evaluate(dataset, metrics=[context_recall])
results[model_name] = score["context_recall"]
# Pick the winner — the leaderboard rank may not match your corpus
The eval set is 100 questions sampled from the actual production query distribution, not questions we write ourselves. For a new project where we do not have production queries yet, we seed the eval set by prompting GPT-4o to generate plausible questions from the corpus and then filtering for questions where we can manually verify the correct chunk.
The benchmark run takes two to three hours for a 100-question set on a 50,000-chunk corpus. We do it before the first deployment, and again whenever the corpus grows by more than 50% or the query distribution shifts measurably.
When switching models is not the answer
Embedding model swaps are not free. Re-indexing a 500,000-chunk corpus costs money (API calls) and time (hours to days depending on throughput). If the eval shows retrieval recall above 0.80 and the primary complaint is answer quality, the problem is downstream of retrieval and a model swap will not help.
The cases where switching models consistently helps:
- Recall@5 is below 0.75 on the eval set.
- The corpus has significant non-English content and the current model is English-first.
- The corpus is narrow-domain (legal, medical, financial instruments) and the current model is general-purpose.
- The failing queries are semantically clear but use domain terminology the model does not handle well.
The cases where it is not the answer:
- Faithfulness is low (the model is hallucinating from retrieved context — this is a generation problem, not retrieval).
- Context recall is reasonable but the chunks returned are too large or too small relative to the question (this is a chunking problem).
- Latency is the complaint (switching to a larger model makes this worse; binary quantization or a smaller model is the lever to pull).
How we approach this on every project
We standardize on three model slots across our RAG deployments:
English-only corpus, general domain: text-embedding-3-large at 1536d. Strong MTEB scores, reliable API, manageable cost.
Multilingual or Vietnamese-heavy corpus: BAAI/bge-m3 self-hosted on a GPU instance or via the BAAI API. The recall improvement over English-first models on Vietnamese content is not marginal — it is the difference between a working product and a broken one.
High-volume corpus where cost dominates: intfloat/multilingual-e5-large at 768d with binary quantization in Qdrant. Recall sits around 3 to 5 points below bge-m3 on our eval sets; the storage and query cost is less than a quarter.
The model swap is always paired with a before/after eval run using the same 100-question set. We have never shipped a model change without that number in hand. The one time we skipped the eval (deadline pressure, the "obvious" choice), recall dropped on a query class we had not covered, and we spent a sprint diagnosing a regression that the eval would have caught in two hours.
Pick your embedding model deliberately. Benchmark it on your domain. Revisit the choice when the corpus grows or the query distribution changes. The ceiling it sets is the ceiling the rest of your pipeline works within.








