The Decision We Got Wrong in Production
Last year our team was building a document Q&A system for a Vietnamese bank. The corpus was around 400,000 internal policy documents, regulatory circulars, and product manuals — all in Vietnamese with mixed formatting. We chose ChromaDB because it was quick to spin up locally and our prototype looked great in a Jupyter notebook.
Three weeks into staging, we hit the wall: query latency degraded past 800ms at 50 concurrent users, we had no clean path to horizontal scaling, and metadata filtering across department codes was running as a post-hoc Python loop over returned results. We refactored to Qdrant in two weeks. The lesson stuck.
Since then we have been systematic about vector database selection before any production RAG project starts. This post is our current decision framework, drawn from running these four databases across five different client engagements.
What Actually Matters in a Vector Database for RAG
Before benchmarking anything, we align on the retrieval characteristics the system actually needs:
- Query pattern: pure semantic search, hybrid (keyword + vector), or filtered vector search?
- Scale: how many vectors now, and in 18 months?
- Latency budget: is 50ms acceptable, or do we need sub-10ms p99?
- Ops capacity: does the team want managed infrastructure or are they comfortable running Kubernetes?
- Filtering complexity: is metadata filtering a first-class citizen or a nice-to-have?
With that framing, here is how each database lands.
ChromaDB — Prototyping Workhorse, Not a Production Engine
ChromaDB is the fastest path from idea to working retrieval. The API is minimal and the local-first design means zero infrastructure overhead during exploration.
import chromadb
client = chromadb.Client()
collection = client.create_collection("policies")
collection.add(
documents=["Circular 13 defines capital adequacy ratios..."],
embeddings=[[0.1, 0.2, ...]],
ids=["doc-001"],
metadatas=[{"department": "risk", "year": 2023}]
)
results = collection.query(
query_embeddings=[[0.1, 0.2, ...]],
n_results=5,
where={"department": "risk"}
)
We use ChromaDB on every project during the chunking and embedding experimentation phase. It lets us iterate on chunk size, overlap, and embedding model without any deployment friction.
Where it falls short: ChromaDB's default HNSW index is in-memory and single-process. At 500k+ vectors with concurrent queries, memory pressure becomes real. The where filter is evaluated client-side after retrieval in many configurations, which is exactly the bottleneck we hit with the bank project. There is no native distributed mode in the open-source version. For anything beyond a single-node prototype or a low-traffic internal tool, you will outgrow it.
Our verdict: Use it until you need production SLAs. Then migrate.
Qdrant — Our Default for Filtered Vector Search
Qdrant has become our go-to for projects where metadata filtering is load-bearing. The key architectural difference is that Qdrant applies payload filters before the ANN search, not after. This means filtering on department, document_type, or date_range does not inflate your result set and then cut it down — the ANN traversal is scoped from the start.
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue, SearchRequest
client = QdrantClient(url="http://localhost:6333")
results = client.search(
collection_name="bank_policies",
query_vector=[0.1, 0.2, ...],
query_filter=Filter(
must=[
FieldCondition(key="department", match=MatchValue(value="risk")),
FieldCondition(key="year", match=MatchValue(value=2024))
]
),
limit=10
)
Qdrant also ships with binary quantization support out of the box, which we have used to cut memory footprint by roughly 32x on large corpora with minimal recall loss — critical when running on constrained infrastructure for a fintech client who mandated on-premise deployment.
Operationally, Qdrant runs as a single binary with a clean REST and gRPC API. Horizontal scaling via sharding and replication is available in the self-hosted version without a paid tier. The Rust implementation means it is genuinely fast under load.
Where it falls short: the query DSL, while powerful, has a learning curve compared to ChromaDB's simplicity. Weaviate's built-in hybrid search is more polished if BM25 + vector is your primary retrieval pattern.
Our verdict: Default choice for production RAG with metadata filtering requirements. Strong fit for on-premise and regulated-industry deployments.
Weaviate — Best Built-in Hybrid Search
Weaviate takes a more opinionated, schema-driven approach. Every collection (called a "class") has a typed schema, and the database natively supports hybrid search — combining BM25 keyword matching with vector similarity in a single query without requiring a separate search layer.
import weaviate
client = weaviate.connect_to_local()
collection = client.collections.get("BankPolicy")
results = collection.query.hybrid(
query="capital adequacy ratio Basel III",
alpha=0.5, # 0 = pure BM25, 1 = pure vector
limit=10
)
The alpha parameter gives us fine-grained control over the BM25/vector blend per query, which matters in domains like legal or regulatory text where exact terminology matching is as important as semantic similarity. We used this on a project for an insurance client where policy clauses had specific article numbers that needed exact matching — hybrid search handled this without bolting on a separate Elasticsearch cluster.
Weaviate also has first-class multimodal support (text + image in the same collection) and integrates module-based vectorizers so you can configure embedding model at the schema level rather than managing it in application code.
Where it falls short: the schema requirement means more upfront design work. For fast-moving exploratory projects, this feels heavyweight. The GraphQL query interface, while powerful, is verbose compared to Qdrant's REST or Pinecone's SDK.
Our verdict: Strong choice when hybrid search is the primary retrieval pattern, or when you are building a knowledge base that mixes structured and unstructured data.
Pinecone — Managed Simplicity at a Price
Pinecone is the only fully managed option in this group. There is no infrastructure to run, no index tuning, no replication configuration. You provision an index, upsert vectors, and query. For teams without dedicated MLOps capacity, that simplicity is real and valuable.
from pinecone import Pinecone
pc = Pinecone(api_key="your-key")
index = pc.Index("bank-policies")
index.upsert(vectors=[
{"id": "doc-001", "values": [0.1, 0.2, ...], "metadata": {"department": "risk"}}
])
results = index.query(
vector=[0.1, 0.2, ...],
filter={"department": {"$eq": "risk"}},
top_k=10,
include_metadata=True
)
Pinecone's serverless tier has made entry-level costs much more predictable, but at scale — tens of millions of vectors with high query volume — the per-read-unit pricing adds up quickly. On one project for a regional e-commerce client, the monthly Pinecone bill at 60M vectors and 500k queries/day was significantly higher than running Qdrant on managed Kubernetes.
The namespace model is also a consideration for multi-tenant RAG: namespaces allow logical separation but they share underlying capacity, and the namespace-level isolation may not satisfy compliance requirements for regulated industries.
Where it falls short: cost at scale, limited control over indexing parameters, and data residency constraints for clients in regulated markets (data must stay in specific cloud regions or on-premise).
Our verdict: Best for startups or teams that want zero infrastructure overhead and have predictable, moderate vector counts. Budget carefully before committing at scale.
How We Make the Decision on Every Project
After these engagements, our selection process takes about 30 minutes at the start of a project:
- Is metadata filtering load-bearing? If yes, Qdrant is the default. Its pre-filtering architecture is the most efficient at scale.
- Is hybrid search (keyword + vector) a first-class requirement? If yes, Weaviate is the strongest built-in option.
- Does the team have zero ops capacity and a modest vector count? Pinecone removes infrastructure friction — just model the cost trajectory before signing up.
- Are we prototyping or running an internal tool with a single user? ChromaDB. Fast to start, easy to swap later.
One thing we always do regardless of choice: abstract the vector database behind a retrieval interface in the application code. We define a VectorRetriever protocol with upsert, search, and delete methods. When we migrated the bank project from ChromaDB to Qdrant, only the adapter implementation changed — the RAG pipeline, reranking logic, and LLM generation layer were untouched.
The vector database is infrastructure. Your retrieval strategy — hybrid search, metadata filtering, reranking, HyDE query expansion — is where the real quality gains live. Pick the database that removes friction for your specific query pattern, then focus your engineering time on the retrieval layer above it.








