A five-step workflow where each step succeeds 95% of the time has a 77% chance of completing end to end. Run it twenty times a day and you are looking at four or five failures before lunch. That is not a reliability problem you can fix by making individual tools better. Each tool is already doing its job. The failures live in the spaces between them -- in the handoffs, the context passing, the partial completions that leave your system in a state no single tool was designed to handle.
This is the composition gap, and it is where most teams building agent automation get stuck. They spend months packaging excellent tools -- good schemas, typed errors, idempotency keys, the whole discipline covered earlier in this series -- and then discover that wiring five excellent tools together into a repeatable workflow is a different engineering problem entirely. The tools are components. The workflow is the product. And the layer that turns one into the other is where the actual difficulty lives.
This is precisely the problem that platforms like AgentPMT are built to address. Rather than forcing teams to hand-wire composition logic from scratch, AgentPMT's workflow builder provides a visual, drag-and-drop interface for defining multi-step tool chains -- sequencing, data flow, and failure handling included. The gap between a toolbox and a workflow is real, but it does not have to be a gap you bridge alone.
The Gap Between a Toolbox and a Workflow
Consider a concrete example. A senior operations engineer knows that assessing customer churn risk means: pull the usage data from the product analytics API, check the support ticket history, verify the payment status, run a risk-scoring model, and draft a retention email if the score exceeds a threshold. That is five tool calls. Any of them could fail. The output of each feeds the input of the next. And the order matters -- you do not draft a retention email before you know whether the customer is actually at risk.
This knowledge -- which tools, in what order, with what data flowing between them, under what conditions -- is the workflow. It exists in the senior engineer's head, which means it executes exactly once per senior engineer per unit of time. The entire promise of agent automation is encoding that knowledge so it runs without the engineer in the loop. But "encoding" is doing a lot of heavy lifting in that sentence.
The composition layer has to solve at least four problems that individual tools do not.
Sequencing and data flow. Which tool runs first, what context passes to the next step, and how do you transform the output of step two into a valid input for step three? If the usage data API returns a nested JSON object and the risk-scoring model expects a flat feature vector, something has to bridge that gap. The model can improvise this mapping -- it is, after all, remarkably good at data transformation -- but "improvise" is exactly the word you do not want in a production workflow. Improvisation means non-determinism. Non-determinism means the workflow works differently every Tuesday.
Partial failure handling. Step three fails. Steps one and two succeeded and their results are sitting in memory. Do you retry step three? Do you roll back steps one and two? What if step two sent a webhook that cannot be unsent? The research on multi-agent system failures from Penn and Stanford is unambiguous on this point: in chained workflows, failures compound. A downstream agent carrying forward incorrect or missing inputs from an upstream failure does not just fail -- it fails in ways that are harder to diagnose because the root cause is two steps back.
Context window management. Every tool call consumes tokens -- for the request, the response, and the model's reasoning about what to do next. A five-step workflow can easily consume 10,000 tokens just on orchestration overhead. If you are passing the full output of each step to the next, your context window fills up and the model starts dropping information from earlier steps. This is not a theoretical concern. It is the most common source of degraded quality in long-running agent workflows, and it gets worse as you add steps.
Cost accumulation. Each step in a workflow has its own cost profile: model tokens, tool fees, API charges. A workflow that costs $0.15 per run and executes 500 times a day is a $75 daily expense. If one step has a retry loop that averages 2.3 attempts, that step's contribution to cost is more than double what you budgeted. Budget caps on individual tools do not prevent runaway spend on the workflow as a whole unless you have a budget at the workflow level too.
These are not novel problems. Distributed systems engineers have been solving them for decades with patterns like the saga pattern, compensating transactions, and durable execution engines. What is new is applying them to workflows where the orchestrator is a language model that makes probabilistic decisions at each step.
MCP Prompts as Workflow Templates
The MCP specification includes three primitives: tools, resources, and prompts. Most teams focus on tools. Resources get some attention. Prompts are consistently underused, which is unfortunate because they are the closest thing the protocol has to a native workflow definition format.
An MCP prompt is a reusable, parameterized template that returns structured messages to guide an agent's behavior. Unlike tools, which perform actions, prompts encode instructions for how to use tools. They are, in effect, playbooks.
Here is what this looks like in practice. A server that exposes three tools -- list-bookmarks, save-bookmark, and delete-bookmark -- gives an agent capabilities. An agent with those three tools can do things, but it does not know what to do. A prompt called research_roundup changes that. As Zuplo's analysis of MCP prompts describes, the prompt can instruct the agent to fetch bookmarks, filter by a parameterized timeframe, group by tags to identify research themes, summarize findings, and suggest next steps. The prompt turns a bag of tools into a directed workflow.
This matters for composition because prompts can reference both tools and resources. A prompt can say "use the get_usage_data tool, then read the customer_profile resource, then call the score_risk tool with the combined context." The WorkOS guide to MCP features illustrates this with a multi-server coordination example: a prompt initiates a workflow, resources provide decision context, and tools execute coordinated actions across different systems. The prompt is the choreographer.
The practical advantages of encoding workflows as MCP prompts are significant. Prompts are versioned on the server -- you update the workflow definition once, and every client gets the new version without code changes. They are discoverable through the standard prompts/list endpoint, so agents can find available workflows the same way they find available tools. And they are parameterized, so the same workflow template can handle "churn risk for enterprise accounts" and "churn risk for SMB accounts" with different arguments. AgentPMT's skills builder takes this concept further, allowing teams to package prompt-driven workflow templates as reusable skills that can be shared, versioned, and composed into larger automations without rewriting orchestration logic each time.
The limitation is equally significant: MCP prompts are user-initiated. The spec is explicit that prompts are "designed to be user-controlled," meaning a human selects which prompt to invoke. They are not autonomous workflow triggers. For fully autonomous multi-step execution, you still need an orchestration layer above the prompt -- something that decides when to invoke the workflow, monitors its progress, and handles failures. The prompt gives you the playbook. It does not give you the player.
Reliability Compounds When You Chain Tools
Article 11 in this series covered the SRE patterns that make individual agent operations reliable: circuit breakers, idempotency, error classification, deterministic contracts. All of those patterns remain essential. But when you compose tools into workflows, their effects multiply in ways that are not obvious.
Return to the 95% per-step reliability example. That 77% end-to-end figure assumes failures are independent, which they never are. If step two fails because of a rate limit on a shared API, step four -- which calls the same API -- will probably fail too. Correlated failures drop your actual reliability below the independent model, sometimes dramatically. A 2025 study of multi-agent system failures found that communication breakdowns, context loss, and coordination failures are the dominant failure modes, and they all get worse as workflow length increases.
The circuit breaker pattern adapts to workflows by operating at the workflow level, not just the tool level. If step three has failed five times in the last hour, a workflow-level circuit breaker can stop initiating new workflow runs entirely instead of letting them proceed to step three and fail there. This prevents the accumulation of partial completions that clog your system and complicate recovery.
Idempotency becomes harder and more important. A single idempotent tool call is straightforward: same key, same result. A five-step workflow needs idempotency at the workflow level -- if the entire workflow is retried, you need to know which steps already completed and resume from the point of failure, not re-execute completed steps. This is exactly what durable execution platforms like Temporal provide. Temporal Workflows automatically capture state at every step, and if a failure occurs, execution resumes where it left off. No orphaned processes, no duplicate side effects, no manual recovery. AgentPMT's audit trails complement this by providing workflow step tracking -- a complete record of every tool call, every input, and every output at each stage of a workflow run, so when failures do occur, the diagnostic context is already captured.
Budget caps also compound. If each of five tools has a $0.50 per-call cap, a naive implementation might allow $2.50 in spend per workflow run. But if step three fails and triggers three retries before succeeding, the actual cost could reach $3.50. Workflow-level budget enforcement -- separate from and in addition to tool-level caps -- is how you prevent a retry storm in one step from consuming the entire daily budget.
Versioning and Testing Composed Workflows
Individual tools have a testing story that developers understand: unit tests for business logic, contract tests for the schema, integration tests for external dependencies. Workflows need all of that, plus testing for the interactions between steps.
The fundamental challenge is that changing one tool in a five-tool workflow can break the workflow without breaking the tool. If get_usage_data starts returning a new field and score_risk does not expect it, the tool-level contract tests both pass. The workflow fails at the boundary between them.
Contract testing -- specifically consumer-driven contract testing -- is the pattern that addresses this. In a workflow, each step is both a consumer of the previous step's output and a provider of the next step's input. A contract test verifies that the provider's actual output matches what the consumer expects. When you change a tool, the contract tests for every downstream consumer tell you whether the change is safe.
The workflow testing pyramid looks different from the service testing pyramid. At the base: unit tests for individual tool logic. In the middle: contract tests between adjacent workflow steps, verifying that output schemas match input expectations. At the top: end-to-end workflow tests using golden runs.
Golden run testing is the workflow equivalent of snapshot testing. You capture one complete, successful execution of the workflow -- every input, every intermediate result, every output -- and store it as a reference. After any change to any tool or to the workflow definition itself, you replay the golden run and compare results. If the output changes, the test fails, and you decide whether the change is intentional or a regression.
This approach has a specific limitation: golden runs only test the happy path. You also need failure injection tests that simulate partial failures at each step. What happens when step three times out? When step two returns an empty result set? When step four hits a budget cap? Each of these scenarios should be a test case with a defined expected behavior. If the expected behavior is "escalate to a human with the failure context," verify that the escalation actually happens with the right context.
The cost of not testing workflows is not abstract. It is the 3 a.m. page about a churn-risk workflow that has been sending blank retention emails for six hours because a tool update changed the format of the risk score from a float to a string.
From Workflow to Product
There is a specific moment in a workflow's lifecycle where it crosses from "internal automation" to "product." It is the moment when someone outside your team says "can I use that?"
A workflow that reliably assesses churn risk, runs within budget, handles partial failures, and has been golden-run tested for three months is not just an automation. It is a capability that other teams want. Internal productization -- making the workflow available to other teams within your organization with documented inputs, outputs, and costs -- is the first step. External productization -- listing the workflow in a marketplace where other organizations can use it -- is the second.
The requirements for internal productization are documentation and access control. Other teams need to know what the workflow does, what it costs, what inputs it expects, and what guarantees it provides. They need to be able to run it under their own budget caps without affecting yours.
External productization adds discoverability and payment. The workflow needs to be findable by agents that are looking for churn risk assessment capabilities. It needs pricing metadata so agents can evaluate the cost before committing. And it needs the same reliability guarantees you would expect from any production service -- SLOs, error budgets, and graceful degradation under load. This is where AgentPMT's DynamicMCP provides the distribution layer: a workflow listed through a dynamic MCP server is discoverable, executable, and billable through the same protocol the agent already speaks. Combined with the AgentPMT marketplace, teams can publish their battle-tested workflows for other organizations to discover, evaluate, and deploy -- turning internal engineering investment into a shareable product.
The economic logic is straightforward. Building a five-tool workflow takes weeks of engineering effort -- designing the composition, handling edge cases, testing failure modes, tuning budgets. That effort has value. If the workflow works for your churn-risk use case, it will work for other companies' churn-risk use cases with different tool providers underneath. The workflow template -- the sequencing logic, the failure handling, the context management -- is the product. The individual tools are components.
Implications for Teams Building Agent Workflows
The composition gap is not a temporary growing pain that the ecosystem will solve on its own. It is a fundamental layer of engineering that teams must invest in deliberately. Several implications follow from this.
Treat workflow composition as its own discipline. The engineers who build excellent individual tools are not automatically the right people to compose those tools into workflows. Composition requires systems thinking -- understanding failure modes across boundaries, managing state across steps, and designing for partial success. Organizations that staff workflow composition as a distinct function, rather than assuming it falls out of tool development, will ship reliable automations faster.
Invest in observability before you invest in scale. A workflow you cannot observe is a workflow you cannot debug. Before scaling a workflow from ten runs per day to five hundred, ensure you have end-to-end tracing that shows exactly what happened at each step, what data flowed between steps, and where failures occurred. AgentPMT's audit trail capabilities -- tracking every tool call and its result at each workflow step -- provide this foundation without requiring teams to build custom instrumentation.
Budget at the workflow level, not just the tool level. Tool-level budget caps are necessary but insufficient. A workflow that chains five tools with individual caps still needs an aggregate budget that accounts for retries, partial failures, and correlated cost spikes. Teams that discover this after a production incident learn it the expensive way.
Start with two-tool chains before building five-tool workflows. The composition problems -- data transformation between steps, partial failure handling, context management -- are all present in a two-step workflow. They are just easier to diagnose and fix. Teams that master two-tool composition and then extend to three, four, five steps build on proven patterns. Teams that jump straight to complex workflows spend most of their time debugging interactions rather than delivering value.
What to Watch
Three developments will shape how workflow composition evolves over the next twelve months.
The MCP specification's November 2025 update introduced the Tasks primitive, which allows servers to perform asynchronous, long-running operations. This moves MCP from synchronous tool calling toward native workflow execution -- a server can now accept a workflow request, execute it over minutes or hours, and report progress and results asynchronously. Watch for how the ecosystem adapts to this: workflow-aware MCP servers that treat multi-step execution as a first-class operation rather than something the client has to coordinate.
Durable execution platforms are converging with agent orchestration. Temporal's integration with OpenAI brings workflow durability directly into the agent layer, and similar integrations are appearing across the ecosystem. The pattern is clear: agent workflows need the same infrastructure that transaction processing systems have used for years. Expect "durable agent workflow" to become a standard capability rather than a custom build.
Contract testing for workflow boundaries is nascent but growing. As teams move from two-tool chains to ten-tool workflows, the need for automated compatibility verification between steps will drive tooling. Pact-style consumer-driven contract testing applied to workflow step boundaries is an obvious evolution. The teams that invest in this now will be the teams that can change tools without breaking workflows later.
Key Takeaways
- The composition layer between tools and workflows -- sequencing, data flow, partial failure handling, and cost accumulation -- is where most of the engineering effort actually lives. Excellent tools do not add up to a reliable workflow without deliberate composition design.
- MCP prompts are an underused primitive that can serve as versionable, discoverable workflow templates, encoding expert knowledge about which tools to call in what order. They are the playbook; you still need the orchestration engine to execute it.
- Workflow-level testing (contract tests between steps, golden run replays, failure injection) is not optional. A five-tool workflow that has only been tested at the tool level is a workflow that has not been tested.
Sources
- MCP Prompts Specification - modelcontextprotocol.io
- Add Reusable MCP Tool Workflows to AI with MCP Prompts - zuplo.com
- Understanding MCP Features: Tools, Resources, Prompts, Sampling, Roots, and Elicitation - workos.com
- MCP 2025-11-25 Specification Changelog - modelcontextprotocol.io
- Why Do Multi-Agent LLM Systems Fail? - arxiv.org
- Orchestrating Ambient Agents with Temporal - temporal.io
- Fault Tolerance in Distributed Systems - temporal.io
- Introduction to Contract Testing with Pact - docs.pact.io
- Contract Testing: Shifting Left with Confidence - tweag.io
- MCP Enterprise Readiness: How the 2025-11-25 Spec Closes the Production Gap - subramanya.ai
Ready to move from individual tools to production-grade workflows? Explore AgentPMT to see how the workflow builder, skills builder, and DynamicMCP can help your team bridge the composition gap.
