Agentic RAG: what it actually costs versus what it actually delivers

A SaaS founder asked us to second-opinion a roadmap his CTO had come back with after an agentic RAG vendor pitch. The plan was eighteen tools, a planner LLM, an executor LLM, a critic LLM, and a synthesis LLM, with autonomous multi-step reasoning over their support and CRM data. The vendor's slides showed accuracy lifts from 67% to 89%. The implied infrastructure bill was around 5x their current model spend.

The product was a real-time chat assistant for their internal sales team. The latency budget was three seconds end-to-end. The team had not measured what fraction of production queries actually required multi-step reasoning.

We told the founder to ship a router first and the agent second, only on the queries that needed it. That conversation cut the projected agentic budget by roughly 80% while preserving most of the accuracy lift. Here is the framework.

What agentic RAG actually does

Naive RAG is a single retrieval followed by a single generation: embed the query, fetch the top-k, prompt the LLM, return the answer. Total cost is roughly two LLM calls (one to embed, one to generate). Total latency is the sum of the embedding, vector search, and generation.

Agentic RAG breaks the query into sub-tasks and runs the retrieve-and-reason loop multiple times. A typical pattern:

Planner breaks the query into sub-questions.
Retrieval runs once per sub-question.
Reasoner synthesizes intermediate results.
Executor decides whether to retrieve more, run a calculation, or call a tool.
Critic checks the answer against the question and triggers a revision if needed.
Synthesizer combines everything into the final response.

Each step is an LLM call. A typical agentic loop runs 3 to 8 LLM calls per query. Some patterns (ReAct with reflection, multi-agent collaboration) run 10 to 20.

The accuracy lift is real. On genuinely multi-step queries, agentic RAG outperforms naive RAG by 20 to 40 percentage points in published benchmarks and in the audits we have run. The cost lift is also real: 5x to 15x the LLM spend per query, plus 3x to 10x the latency.

The fraction-of-queries question

The conversation the vendor pitch never starts with: what fraction of production queries actually require multi-step reasoning?

For most B2B SaaS products we have audited, the breakdown looks roughly like this:

65 to 75% of queries are single-hop lookups. "What is the refund policy?" "How do I export reports?" "What does error code 502 mean?" These are textbook RAG questions. A naive pipeline answers them correctly at 85% recall and a fraction of the agentic cost.
15 to 25% of queries are multi-step. "Which enterprise customers complained about pricing in Q4?" "Compare the deployment guides for AWS and Azure and tell me which is faster to set up." These benefit from agentic patterns. The accuracy lift is real and the latency hit is acceptable because the queries are inherently slower-thinking.
5 to 10% of queries are not RAG questions at all. They are SQL queries disguised as natural language ("total MRR by region last quarter") or transactional requests ("create a ticket for me"). These should be routed to a dedicated agent or directly to the database, not to a RAG pipeline.

Running an agentic pipeline on the 65% to 75% of queries that are single-hop is paying 10x cost for no accuracy lift. The agent loop just does more steps to arrive at the same answer the naive pipeline produces.

What we ship instead

Sapota's default for any production RAG product where agentic capabilities are on the roadmap:

Step 1: build the naive RAG pipeline first. Hybrid search, reranker, faithfulness gate, evaluation. This is the floor. Everything else is layered on top.

Step 2: add a query router. A small fast LLM (Llama 3.1 8B is plenty) classifies each incoming query as single-hop, multi-hop, or structured. This costs one extra LLM call per query, around 50ms of latency. The router accuracy needs to be measured and tracked, because misroutes are expensive in both directions.

Step 3: dispatch by class. Single-hop queries go to the naive pipeline. Multi-hop queries go to the agentic pipeline. Structured queries go to a SQL agent or whatever non-RAG tool fits.

Step 4: build the agentic pipeline only for the queries that need it. This is where the vendor pitch starts. The point is that the agent is not the entire product. It is one branch the router dispatches to.

The cost math: if 70% of queries hit the naive pipeline at $0.005 per query and 25% hit the agent at $0.05 per query, the blended average is $0.016 per query. If everything hits the agent, it is $0.05. The router pays for itself many times over.

When the agent is genuinely the product

Some products are built around multi-step reasoning by design. The agent is not an upgrade, it is the core value proposition. Cases we have seen and recommend full-agent for:

Research assistants that take a complex question and produce a structured report (think: AI consultant for due diligence). The user expects 30-second responses. Multi-step reasoning is the value.
Code agents that read a codebase, plan a refactor, and execute it across multiple files. The agent's planning and tool-use capabilities are the entire product.
Customer support agents that need to read account history, query backend systems, generate responses, and escalate when uncertain. The decision-making is the work.

For these products, the question is not whether to use agents, but how to keep the agent loop efficient (fewer steps, smarter tools, better caching).

For everything else (most B2B SaaS RAG products, most internal knowledge assistants, most customer-facing chatbots), the router-then-agent pattern is the right architecture.

The latency conversation

Agentic RAG is slow. A 5-step loop with 2-second LLM calls is a 10-second response. For batch use cases (overnight report generation, async customer email triage), this is fine. For real-time chat, it is the end of the user experience.

Mitigation patterns we use:

Streaming intermediate steps to the user so the perceived latency is the time to first token, not the time to final answer. Users will wait if they see progress.
Smaller models in the loop. The planner and the critic do not need GPT-4. Llama 3.1 8B or Haiku is enough. Reserve the expensive model for the final synthesis.
Aggressive caching of intermediate results. Sub-questions that recur across queries (which they do, in narrow domains) hit a cache instead of re-running.
Concurrent tool calls where the steps are independent. Most agent frameworks support parallel tool dispatch. Many implementations do not use it.
Hard step limits. Cap the loop at 5 steps. If the agent has not converged by then, return a partial answer with a "I could not fully answer this; here is what I found" caveat. Better than a 30-second timeout.

These are the techniques that make agentic patterns viable for products with sub-10-second latency budgets.

What we sent the founder

The recommendation was three lines:

Ship the naive pipeline first. Spend two weeks getting it to 85% recall on the eval set.
Add the router as the second sprint. Measure misroute rate; aim for under 5%.
Build the agent only for the multi-hop branch. Start with a 3-step ReAct loop. Add reflection only if the eval shows it is worth the cost.

The vendor's eighteen-tool plan was a v3 conversation, not a v1. By the time the team has the routing and the agent for the multi-hop branch working, they will know which tools they actually need. Most of the eighteen will not survive contact with the production query log.

The founder's CTO sent the vendor a polite "let's revisit this in six months" email.

If a vendor is pitching agents at you

If your team is being told that agentic RAG is the next quarter's roadmap, the question to ask before signing is what fraction of your production queries actually need it. If the vendor cannot answer that with your data, they are pitching a product, not a solution.

Sapota offers a one-week query distribution audit that takes your production query log, classifies the queries by complexity, and produces a sized recommendation for which fraction needs agents and which is better served by naive RAG with a router. The output is a budget-and-architecture document the team can take to the vendor, the board, or both.

Reach out via the AI engineering page with your current RAG architecture and approximate query volume. The diagnostic conversation almost always reframes the conversation from "agents or not" to "how much of the product is agents and what is the rest."

Phuc Tang

AI Engineer

Agentic RAG: what it actually costs versus what it actually delivers

What agentic RAG actually does

The fraction-of-queries question

What we ship instead

When the agent is genuinely the product

The latency conversation

What we sent the founder

If a vendor is pitching agents at you

Phuc Tang

Recent posts

Business Process Flows in Dataverse: when they help, when they lock in

When SFMC Open Rates Drop: Diagnose and Fix in Four Steps

Three SFMC traps the MC-202 prep made us re-document

When Shopify fits a small business: the platform-selection call

Share Your Story

Contact Us

Email

WhatsApp

Office

WHY CHOOSE US

Tell us about your project

Contacts

Company

Services

contacts

Agentic RAG: what it actually costs versus what it actually delivers

What agentic RAG actually does

The fraction-of-queries question

What we ship instead

When the agent is genuinely the product

The latency conversation

What we sent the founder

If a vendor is pitching agents at you

Phuc Tang

Recent posts

Tags

Continue reading

Share Your Story

Contact Us

Email

WhatsApp

Office

WHY CHOOSE US

Tell us about your project

contacts