The four building blocks of a production agent (and which one teams skip)

Most teams that "build an agent" build a model with tools bolted on and call it done. Then it fails in production and nobody can say why. A production agent is four distinct parts, and the part teams skip is always the same one. Here is the anatomy, and what each piece is actually responsible for.

Published Jun 09, 2026

The four building blocks of a production agent (and which one teams skip)

Key takeaways

A production agent is four distinct building blocks: the model (reasoning), tools (actions), the orchestration loop (control), and the context layer (what the model sees on each step). Most teams build the model and tools, then wonder why the thing is unreliable.
The orchestration loop is where reliability lives. It decides when to call a tool, when to stop, what to do on error, and how many iterations are allowed. Skipping it means the model improvises all of that, differently every run.
The context layer is the most-skipped block. What you put in front of the model on each step, and what you leave out, determines accuracy more than the model choice does. A weaker model with disciplined context beats a stronger model drowning in irrelevant history.
Adding more tools is the most common false fix. When an agent underperforms, teams add tools, but unreliability almost always traces to a missing loop or a sloppy context layer, not a missing capability.

We get a particular kind of message a few times a month. A team has built an agent, it works in the demo, and it falls apart the moment real users touch it. They want to know what they did wrong. When we ask them to describe the architecture, the answer is almost always the same: "It's GPT with a few tools." That sentence is the problem. It describes two of the four parts of an agent and leaves out the two that determine whether it survives production.

An agent is not a model with tools attached. It is four distinct building blocks, each with a separate job. When teams skip a block, the model silently takes over that block's responsibility and does it badly, because improvising control flow and curating context are not what a language model is good at. Here is the full anatomy, and the one block that is missing almost every time.

Block 1: the model

The model is the reasoning engine. It reads the current situation and decides what to do next: call a tool, ask a question, or produce a final answer. This is the part everyone gets right because it is the part you cannot skip. You pick a model, you write a prompt, you get reasoning.

The trap here is over-indexing on it. Teams believe that if the agent is unreliable, they need a bigger model. Sometimes that helps at the margin. But model choice is rarely the bottleneck in a failing agent. We have watched teams jump to the most expensive model available and see the failure rate barely move, because the actual problem was in a block they had not built yet.

A good rule: the model is responsible for judgement, and only judgement. If your agent is failing at something that is not a judgement call (skipping steps, looping, losing track of state), a bigger model will not fix it, because the failure is not a reasoning failure.

Block 2: tools

Tools are how the agent acts on the world. A database query, an API call, a search, a calculation, a write to a ticketing system. Each tool is a function the model can choose to invoke, with a name, a description, and a schema for its inputs.

Teams also generally get tools right, at least mechanically. Where they go wrong is quantity. The instinct, when an agent underperforms, is to add more tools. More capability should mean better results. It usually means worse ones, because every tool you add is another choice the model has to get right on every step, and another description competing for its attention.

We did an audit last quarter on an agent with nineteen tools. It was choosing the wrong one constantly. We cut it to six by merging overlapping tools and deleting ones that were used in under one percent of runs. Accuracy went up, not down, because the model's decision at each step got dramatically simpler. Tools are a capability, but each one is also a tax on every decision the agent makes.

Block 3: the orchestration loop

This is the first block teams skip, and it is where reliability actually lives.

The orchestration loop is the control structure around the model. It decides: when does the agent call a tool versus answer? How many iterations is it allowed before we stop? What happens when a tool returns an error: retry, skip, escalate, abort? When is the task actually done? In what order do steps run, and can any run in parallel?

When you do not build this block, the model improvises all of it. And the model improvises it differently on every run, because that is what sampling from a probability distribution does. This is the source of the "works four times out of five" behavior we see constantly: there is no loop enforcing the structure, so the structure is whatever the model felt like that time.

A real orchestration loop is mostly code, not prompt. It is the difference between hoping the agent stops at the right time and writing a stopping condition. We wrote a whole piece on the most common version of this mistake, letting an agent improvise a process whose steps never change, because it is the single most frequent architectural error we see. The loop is also where the choice between ReAct and Planning actually gets made: those are two different shapes of orchestration loop, not two different models.

Block 4: the context layer

This is the block teams skip most often and understand least. It is also, in our experience, the one that determines accuracy more than any other.

The context layer is everything the model sees on a given step: the system prompt, the relevant history, the tool outputs so far, the retrieved knowledge, the current goal. It is not "the conversation." It is a deliberate, curated view assembled fresh for each model call, containing what the model needs to make this decision and, crucially, leaving out what it does not.

The reason this matters so much: a language model attends to everything in its context, including the irrelevant parts. Dump the entire conversation history and every tool output into every call, and the model drowns. The signal it needs for the current step is buried under ten steps of stale observations. Accuracy drops, latency rises, cost rises, and the failure looks random because it depends on exactly how much junk happened to accumulate.

We have repeatedly seen a weaker, cheaper model with a disciplined context layer outperform a top-tier model fed an undisciplined firehose. The skill is not "give the model more information." It is "give the model exactly the right information for this step and nothing else." That discipline is a real engineering layer with real decisions: what to summarize, what to drop, what to retrieve, what to keep verbatim. Teams that skip it are letting the context grow by accident and then blaming the model for the result. It is closely tied to how you handle memory in a long-running agent, which is really the context layer extended across time.

How to tell which block is broken

When an agent misbehaves, the symptom usually points straight at the missing block:

Wrong final answers on clear questions → model or context layer. Check what the model actually saw before blaming the model.
Skips steps, loops, hits the iteration limit, "works sometimes" → orchestration loop. There is no structure enforcing the structure.
Picks the wrong tool, calls tools it should not → too many tools, or tool descriptions overlapping. Block 2.
Degrades as the conversation gets longer → context layer. You are accumulating junk and feeding it back in.
Cannot reproduce the failure → almost always the loop. Deterministic structure makes failures reproducible.

Notice how few of these are fixed by a bigger model. Most failures live in the two blocks teams do not build.

The order to build them in

If we were starting an agent from scratch with a team today, we would build the blocks in this order, which is roughly the reverse of how most teams approach it:

Context layer first. Decide exactly what the model sees on each step before you write a single tool. This forces clarity about what the agent actually needs to know.
Orchestration loop second. Write the control structure: stopping conditions, error handling, iteration limits. In code, not prompt.
Tools third. Add the minimum set of tools the task genuinely requires. Resist adding more.
Model last. Start with a mid-tier model. Only move up if you have evidence the bottleneck is genuinely reasoning, which it usually is not.

Most teams do this in the exact opposite order: model first, tools second, and the loop and context layer never, because the demo worked without them. Then production arrives and the two skipped blocks are exactly what production needed.

If your agent works in the demo but not in production

That gap is almost always two missing blocks, not a weak model. The demo worked because a human drove it down the happy path; production fails because real users find the paths your missing orchestration loop and context layer were supposed to handle.

Sapota runs a one-week agent architecture audit that maps your agent against these four blocks, identifies which ones the model is currently improvising, and ships the missing pieces as a working integration. We have done this for support agents, document processors, reporting agents, and internal copilots. The missing blocks are remarkably consistent across all of them.

Reach out via the AI engineering page with a description of what your agent does and where it breaks. We can usually tell you which block is missing within the first conversation.

Daniel Duong

Salesforce + AI Engineer

Share Your Story

We build trust by delivering what we promise – the first time and every time!

We'd love to hear your vision. Our IT experts will reach out to you during business hours to discuss making it happen.

WHY CHOOSE US

"Collaborate, Elevate, Celebrate where Associates - Create Project Excellence"

SapotaCorp beyond the IT industry standard, we are

Certificated
Assured quality
Extra maintenance

The four building blocks of a production agent (and which one teams skip)

Key takeaways

Block 1: the model

Block 2: tools

Block 3: the orchestration loop

Block 4: the context layer

How to tell which block is broken

The order to build them in

If your agent works in the demo but not in production

Daniel Duong

Need this on your team?

Share Your Story

Contact Us

Email

WhatsApp

Office

WHY CHOOSE US

Tell us about your project

Contacts

Company

Services

contacts

The four building blocks of a production agent (and which one teams skip)

Key takeaways

Block 1: the model

Block 2: tools

Block 3: the orchestration loop

Block 4: the context layer

How to tell which block is broken

The order to build them in

If your agent works in the demo but not in production

Daniel Duong

Need this on your team?

More from AI Agents

Flows vs agents: when to hardcode instead of letting agent decide

Agentic RAG: what it actually costs versus what it delivers

Four forensics when a production AI agent fails

Cutting agent latency from 30s to 8s without model swap

What to monitor in an AI agent before you launch (and after)

Faithfulness gate: the agent layer most teams skip

Share Your Story

Contact Us

Email

WhatsApp

Office

WHY CHOOSE US

Tell us about your project

contacts