SapotaCorp

RAG Query Transformation: HyDE, Step-Back Prompting, and Sub-Query Decomposition

When a fintech client's RAG chatbot kept returning irrelevant results for perfectly reasonable user questions, we traced the root cause not to poor chunking or weak embeddings — but to the queries themselves. This post breaks down three query transformation techniques we now apply in production: HyDE, step-back prompting, and sub-query decomposition.

RAG Query Transformation: HyDE, Step-Back Prompting, and Sub-Query Decomposition

Key takeaways

  • Raw user queries are often too short or vague to match the dense, context-rich language inside your document chunks — transformation bridges that gap.
  • HyDE generates a hypothetical answer first, then uses that answer as the retrieval query, dramatically improving semantic alignment with stored content.
  • Step-back prompting abstracts a specific question into its underlying principle, retrieving broader context before answering the narrow question.
  • Sub-query decomposition breaks complex multi-part questions into atomic retrievals, then synthesizes the results — preventing any single weak retrieval from collapsing the whole answer.
  • These three techniques stack well together and should be selected based on query type: factual gaps favor HyDE, concept-heavy questions favor step-back, and multi-hop questions demand decomposition.

The Query Problem We Kept Ignoring

A few months ago, we were running a RAG system for a fintech client — document corpus of loan product manuals, regulatory notices, and internal policy PDFs. The retrieval pipeline was solid: careful chunking, good embedding model, hybrid search with BM25 + dense vectors. But users kept getting irrelevant results for questions like "what happens to my interest rate if I miss a payment during the grace period?"

We ran diagnostics. The chunks were fine. The embeddings were fine. But when we embedded that raw query and ran a similarity search, we'd surface chunks that talked about payment schedules or grace periods in isolation — never the chunk that actually explained the conditional rate logic.

The problem wasn't the retrieval system. It was the query itself.

User questions are conversational, short, and often underspecified. Your document chunks are dense, precise, and assume domain context. The semantic gap between those two registers is where most retrieval failures live. Query transformation is how you close it.

We now apply three techniques depending on the situation: HyDE, step-back prompting, and sub-query decomposition. Here's how each one works and when we reach for it.


HyDE: Generate the Answer You Wish You Had

Hypothetical Document Embeddings (HyDE) flips the retrieval problem around. Instead of embedding the user's question and hoping it semantically matches your chunks, you ask an LLM to generate a hypothetical answer — a short passage that looks like it came from your corpus — and then embed that for retrieval.

The intuition is straightforward: the hypothetical answer will be written in the same register, vocabulary, and structure as your actual documents. When you embed it, you end up in a much closer region of the vector space to the real answer chunks.

Here's what we use in production:

def hyde_retrieve(query: str, retriever, llm, k: int = 5) -> list:
    # Step 1: generate a hypothetical answer
    hyde_prompt = f"""Write a short, factual paragraph that directly answers the following question,
as if you were extracting it from an official financial document.
Do not hedge or say you don't know — write as if the document exists.

Question: {query}

Hypothetical answer:"""

    hypothetical_doc = llm.complete(hyde_prompt).text.strip()

    # Step 2: embed the hypothetical answer, not the original query
    results = retriever.retrieve(hypothetical_doc, top_k=k)
    return results

For our fintech client, this single change improved retrieval precision on complex policy questions by a measurable margin in our RAGAs evaluation runs. The hypothetical answer for the grace period question contained phrases like "outstanding balance", "penalty rate adjustment", and "consecutive missed payments" — exactly the vocabulary in the relevant chunks.

The obvious risk: if the LLM generates a hallucinated hypothetical that confidently describes a policy that doesn't exist, you'll retrieve adjacent-but-wrong chunks and confuse the generator downstream. We mitigate this by keeping the hypothetical short (one paragraph), and by always running reranking afterward to sanity-check relevance against the original query.


Step-Back Prompting: Retrieve the Principle, Then Answer the Detail

Some questions fail retrieval not because they're too vague — but because they're too specific. A question like "Can a sole proprietor in Hanoi with annual revenue under 500 million VND apply for the SME digital loan tier?" is so narrow that the exact answer probably isn't in any single chunk. But the underlying principle — "what are the eligibility criteria for the SME digital loan tier?" — almost certainly is.

Step-back prompting works in two phases. First, you ask the LLM to abstract the question to its underlying concept or principle. Then you retrieve against that abstracted question. Finally, you pass both the original question and the retrieved context to the generator.

def stepback_retrieve(query: str, retriever, llm, k: int = 5) -> tuple[str, list]:
    # Step 1: abstract the question
    stepback_prompt = f"""You are helping a retrieval system find relevant documents.
Given a specific user question, rephrase it as a more general, conceptual question
that captures the underlying topic or principle.

Specific question: {query}

General question:"""

    abstract_query = llm.complete(stepback_prompt).text.strip()

    # Step 2: retrieve against the abstracted question
    results = retriever.retrieve(abstract_query, top_k=k)

    return abstract_query, results


# At generation time, use both:
def generate_with_stepback(original_query, abstract_query, context_chunks, llm):
    prompt = f"""Use the following context to answer the specific question below.
The context was retrieved using the general question: "{abstract_query}"

Context:
{format_chunks(context_chunks)}

Specific question: {original_query}

Answer:"""
    return llm.complete(prompt).text

We've found step-back particularly valuable for regulatory and compliance queries. Users ask hyper-specific questions; the documents contain general rules with worked examples. The abstraction step surfaces the rule, and the generator handles applying it to the specific case.

One thing to watch: the abstraction can sometimes over-generalize. "What's the late fee for credit card tier B?" might get abstracted to "what are the fees for credit products?" — retrieving too broad a set. We address this by tuning the step-back prompt to stay one level of abstraction above the original, not two.


Sub-Query Decomposition: Divide and Retrieve

The third technique is the most powerful for multi-hop questions, and also the most expensive. When a user asks something like "compare the collateral requirements and processing times for the personal loan and the business expansion loan", that's not one question — it's at least four: collateral for loan A, collateral for loan B, processing time for loan A, processing time for loan B.

A single retrieval pass will surface chunks that partially address the question, and the generator will either miss dimensions or conflate information from different products. Decomposition breaks the compound question into atomic sub-queries, runs each retrieval independently, and then synthesizes all results into a coherent answer.

def decompose_and_retrieve(query: str, retriever, llm, k: int = 3) -> dict:
    # Step 1: decompose
    decompose_prompt = f"""Break the following question into simple, atomic sub-questions.
Each sub-question should be answerable from a single document passage.
Return a numbered list of sub-questions only.

Question: {query}

Sub-questions:"""

    raw_subqueries = llm.complete(decompose_prompt).text.strip()
    subqueries = [
        line.split(". ", 1)[-1].strip()
        for line in raw_subqueries.splitlines()
        if line.strip()
    ]

    # Step 2: retrieve for each sub-query
    results_by_subquery = {}
    for sq in subqueries:
        results_by_subquery[sq] = retriever.retrieve(sq, top_k=k)

    return results_by_subquery


def synthesize_from_subqueries(original_query, results_by_subquery, llm):
    context_blocks = []
    for sq, chunks in results_by_subquery.items():
        context_blocks.append(f"--- Sub-question: {sq} ---\n{format_chunks(chunks)}")

    prompt = f"""Using the retrieved context below (organized by sub-question),
answer the original question comprehensively.

{chr(10).join(context_blocks)}

Original question: {original_query}

Answer:"""
    return llm.complete(prompt).text

The cost here is real: if a question decomposes into five sub-queries, you're running five retrieval passes plus two LLM calls (decomposition and synthesis) instead of one retrieval plus one generation. For our production systems, we gate this technique behind a complexity classifier — if the query contains comparison words, conjunctions linking distinct topics, or multiple entity references, we route it through decomposition. Simple factual questions get standard retrieval.


How We Combine These in Practice

These three techniques aren't mutually exclusive. Our current production router looks roughly like this:

Query type Technique
Short factual question, sparse vocabulary HyDE
Highly specific, narrow question needing principles Step-back
Multi-part, comparison, or multi-entity question Sub-query decomposition
Ambiguous HyDE + rerank

We evaluate each technique using RAGAs — specifically context precision, context recall, and answer faithfulness — on a held-out question set before deploying any change. HyDE tends to help context recall most. Step-back tends to improve faithfulness on regulatory topics. Decomposition helps both precision and recall on complex queries but requires more careful latency budgeting.

The underlying lesson is one we now share with every team that asks us why their RAG system "isn't working": before you tune your chunks or swap your embedding model, look at what you're actually sending to retrieval. Garbage in, garbage out — but with RAG, even well-formed garbage in the wrong register will fail you. Transform the query to meet your corpus where it lives.

That's the habit we've built into every project from day one.

Engineering certifications

Sapota engineers hold credentials on RAG Systems. Each badge links to the individual engineer's credly profile.

Browse RAG Systems certs

Need this on your team?

Sapota engineers ship the patterns you read here. Two-week paid trial, direct pricing from $1,800/ engineer/month, no agency markup.

Get a quote
Contact Us Now

Share Your Story

We build trust by delivering what we promise – the first time and every time!

We'd love to hear your vision. Our IT experts will reach out to you during business hours to discuss making it happen.

WHY CHOOSE US

"Collaborate, Elevate, Celebrate where Associates - Create Project Excellence"

SapotaCorp beyond the IT industry standard, we are

  • Certificated
  • Assured quality
  • Extra maintenance

Tell us about your project