The Query Problem We Kept Ignoring
A few months ago, we were running a RAG system for a fintech client — document corpus of loan product manuals, regulatory notices, and internal policy PDFs. The retrieval pipeline was solid: careful chunking, good embedding model, hybrid search with BM25 + dense vectors. But users kept getting irrelevant results for questions like "what happens to my interest rate if I miss a payment during the grace period?"
We ran diagnostics. The chunks were fine. The embeddings were fine. But when we embedded that raw query and ran a similarity search, we'd surface chunks that talked about payment schedules or grace periods in isolation — never the chunk that actually explained the conditional rate logic.
The problem wasn't the retrieval system. It was the query itself.
User questions are conversational, short, and often underspecified. Your document chunks are dense, precise, and assume domain context. The semantic gap between those two registers is where most retrieval failures live. Query transformation is how you close it.
We now apply three techniques depending on the situation: HyDE, step-back prompting, and sub-query decomposition. Here's how each one works and when we reach for it.
HyDE: Generate the Answer You Wish You Had
Hypothetical Document Embeddings (HyDE) flips the retrieval problem around. Instead of embedding the user's question and hoping it semantically matches your chunks, you ask an LLM to generate a hypothetical answer — a short passage that looks like it came from your corpus — and then embed that for retrieval.
The intuition is straightforward: the hypothetical answer will be written in the same register, vocabulary, and structure as your actual documents. When you embed it, you end up in a much closer region of the vector space to the real answer chunks.
Here's what we use in production:
def hyde_retrieve(query: str, retriever, llm, k: int = 5) -> list:
# Step 1: generate a hypothetical answer
hyde_prompt = f"""Write a short, factual paragraph that directly answers the following question,
as if you were extracting it from an official financial document.
Do not hedge or say you don't know — write as if the document exists.
Question: {query}
Hypothetical answer:"""
hypothetical_doc = llm.complete(hyde_prompt).text.strip()
# Step 2: embed the hypothetical answer, not the original query
results = retriever.retrieve(hypothetical_doc, top_k=k)
return results
For our fintech client, this single change improved retrieval precision on complex policy questions by a measurable margin in our RAGAs evaluation runs. The hypothetical answer for the grace period question contained phrases like "outstanding balance", "penalty rate adjustment", and "consecutive missed payments" — exactly the vocabulary in the relevant chunks.
The obvious risk: if the LLM generates a hallucinated hypothetical that confidently describes a policy that doesn't exist, you'll retrieve adjacent-but-wrong chunks and confuse the generator downstream. We mitigate this by keeping the hypothetical short (one paragraph), and by always running reranking afterward to sanity-check relevance against the original query.
Step-Back Prompting: Retrieve the Principle, Then Answer the Detail
Some questions fail retrieval not because they're too vague — but because they're too specific. A question like "Can a sole proprietor in Hanoi with annual revenue under 500 million VND apply for the SME digital loan tier?" is so narrow that the exact answer probably isn't in any single chunk. But the underlying principle — "what are the eligibility criteria for the SME digital loan tier?" — almost certainly is.
Step-back prompting works in two phases. First, you ask the LLM to abstract the question to its underlying concept or principle. Then you retrieve against that abstracted question. Finally, you pass both the original question and the retrieved context to the generator.
def stepback_retrieve(query: str, retriever, llm, k: int = 5) -> tuple[str, list]:
# Step 1: abstract the question
stepback_prompt = f"""You are helping a retrieval system find relevant documents.
Given a specific user question, rephrase it as a more general, conceptual question
that captures the underlying topic or principle.
Specific question: {query}
General question:"""
abstract_query = llm.complete(stepback_prompt).text.strip()
# Step 2: retrieve against the abstracted question
results = retriever.retrieve(abstract_query, top_k=k)
return abstract_query, results
# At generation time, use both:
def generate_with_stepback(original_query, abstract_query, context_chunks, llm):
prompt = f"""Use the following context to answer the specific question below.
The context was retrieved using the general question: "{abstract_query}"
Context:
{format_chunks(context_chunks)}
Specific question: {original_query}
Answer:"""
return llm.complete(prompt).text
We've found step-back particularly valuable for regulatory and compliance queries. Users ask hyper-specific questions; the documents contain general rules with worked examples. The abstraction step surfaces the rule, and the generator handles applying it to the specific case.
One thing to watch: the abstraction can sometimes over-generalize. "What's the late fee for credit card tier B?" might get abstracted to "what are the fees for credit products?" — retrieving too broad a set. We address this by tuning the step-back prompt to stay one level of abstraction above the original, not two.
Sub-Query Decomposition: Divide and Retrieve
The third technique is the most powerful for multi-hop questions, and also the most expensive. When a user asks something like "compare the collateral requirements and processing times for the personal loan and the business expansion loan", that's not one question — it's at least four: collateral for loan A, collateral for loan B, processing time for loan A, processing time for loan B.
A single retrieval pass will surface chunks that partially address the question, and the generator will either miss dimensions or conflate information from different products. Decomposition breaks the compound question into atomic sub-queries, runs each retrieval independently, and then synthesizes all results into a coherent answer.
def decompose_and_retrieve(query: str, retriever, llm, k: int = 3) -> dict:
# Step 1: decompose
decompose_prompt = f"""Break the following question into simple, atomic sub-questions.
Each sub-question should be answerable from a single document passage.
Return a numbered list of sub-questions only.
Question: {query}
Sub-questions:"""
raw_subqueries = llm.complete(decompose_prompt).text.strip()
subqueries = [
line.split(". ", 1)[-1].strip()
for line in raw_subqueries.splitlines()
if line.strip()
]
# Step 2: retrieve for each sub-query
results_by_subquery = {}
for sq in subqueries:
results_by_subquery[sq] = retriever.retrieve(sq, top_k=k)
return results_by_subquery
def synthesize_from_subqueries(original_query, results_by_subquery, llm):
context_blocks = []
for sq, chunks in results_by_subquery.items():
context_blocks.append(f"--- Sub-question: {sq} ---\n{format_chunks(chunks)}")
prompt = f"""Using the retrieved context below (organized by sub-question),
answer the original question comprehensively.
{chr(10).join(context_blocks)}
Original question: {original_query}
Answer:"""
return llm.complete(prompt).text
The cost here is real: if a question decomposes into five sub-queries, you're running five retrieval passes plus two LLM calls (decomposition and synthesis) instead of one retrieval plus one generation. For our production systems, we gate this technique behind a complexity classifier — if the query contains comparison words, conjunctions linking distinct topics, or multiple entity references, we route it through decomposition. Simple factual questions get standard retrieval.
How We Combine These in Practice
These three techniques aren't mutually exclusive. Our current production router looks roughly like this:
| Query type | Technique |
|---|---|
| Short factual question, sparse vocabulary | HyDE |
| Highly specific, narrow question needing principles | Step-back |
| Multi-part, comparison, or multi-entity question | Sub-query decomposition |
| Ambiguous | HyDE + rerank |
We evaluate each technique using RAGAs — specifically context precision, context recall, and answer faithfulness — on a held-out question set before deploying any change. HyDE tends to help context recall most. Step-back tends to improve faithfulness on regulatory topics. Decomposition helps both precision and recall on complex queries but requires more careful latency budgeting.
The underlying lesson is one we now share with every team that asks us why their RAG system "isn't working": before you tune your chunks or swap your embedding model, look at what you're actually sending to retrieval. Garbage in, garbage out — but with RAG, even well-formed garbage in the wrong register will fail you. Transform the query to meet your corpus where it lives.
That's the habit we've built into every project from day one.








