Two Agents Are a Distributed System

Two Agents Are a Distributed System

By Stephanie GoodmanJanuary 20, 2026

The moment you add a second agent, every distributed computing problem comes rushing back. This article covers agent-to-agent contracts, coordination patterns (pipeline, fan-out/fan-in, blackboard), shared state pitfalls, budget allocation across agent pools, and how to prevent work duplication in multi-agent systems.

Successfully Implementing AI Agentsautonomous agentsMulti-Agent WorkflowsAI Powered InfrastructureAgentPMTEnterprise AI Implementation

The moment you add a second agent, every problem distributed computing solved over forty years comes rushing back. Here is how to coordinate multiple agents without reinventing consensus protocols from scratch.

A single agent calling tools is a program. It has inputs, it has outputs, it has a control flow you can trace. You can instrument it, cap its budget, and replay its failures. We have covered how to do all of that -- workflow reliability, deterministic tool design, budget scoping. The engineering is hard but the conceptual model is familiar: one caller, many callees, linear accountability.

Add a second agent, and the model breaks. Now you have two callers that might invoke the same tool, update the same record, or spend from the same budget. You have handoffs where context must survive the transfer. You have the question of who is in charge when both agents disagree about what to do next. You have, in other words, a distributed system -- and distributed systems have been eating engineering teams alive since long before anyone attached a language model to an API.

The multi-agent future is arriving fast. Google launched the Agent2Agent protocol with more than fifty technology partners. Microsoft merged AutoGen and Semantic Kernel into a unified Agent Framework targeting general availability in early 2026. CrewAI, LangGraph, and OpenAI's Agents SDK all ship multi-agent primitives. The infrastructure is being built. The question is whether teams adopting it will repeat the mistakes that distributed systems engineers spent decades correcting, or whether they will learn from that history and skip the expensive parts. Infrastructure platforms like AgentPMT — with centralized tool governance through DynamicMCP, per-agent budget controls, and structured audit trails — provide the coordination layer that prevents teams from rebuilding distributed systems primitives from scratch.

This article is about the expensive parts: the contracts between agents, the coordination patterns that actually work, the budget allocation problem that multiplies when agents multiply, and why shared mutable state is the trap that looks like a shortcut.

Agent-to-Agent Contracts Are Not Tool Contracts

In our piece on deterministic tools, we covered the contract between an agent and a tool: strict input schemas, typed outputs, validation at the boundary. That contract works because the relationship is asymmetric. The agent calls. The tool executes. The tool has no opinions, no memory, no agenda. It is a function.

Agent-to-agent contracts are fundamentally different. Both parties have autonomy. Both can generate novel outputs. Both might interpret the same instruction differently. And both are non-deterministic -- meaning the contract between them must account for the fact that neither side will behave identically across runs.

A useful agent-to-agent contract specifies four things. First, the capability boundary: what this agent can do, what it cannot do, and what it will refuse to do. Google's A2A protocol captures this with "Agent Cards" -- JSON documents that declare an agent's skills, supported input types, and authentication requirements. This is not a courtesy. It is how a coordinator agent decides who gets which task without hallucinating capabilities that do not exist.

Second, the input/output schema. Not just data types, but semantic expectations. If a research agent hands off to a writing agent, the contract should define what "research output" looks like: a structured object with sourced claims, not a freeform paragraph that might or might not contain what the writer needs. The tighter this contract, the fewer retries and clarification loops you pay for.

Third, a delegation protocol. Who initiates the handoff? What context transfers with it? What happens if the receiving agent rejects the task? OpenAI's original Swarm framework reduced this to a single primitive -- handoff functions that return the next agent -- and the simplicity was the point. A delegation protocol that requires a committee meeting is a delegation protocol that will be bypassed.

Fourth, and most overlooked: a completion signal. How does the delegating agent know the work is done? Does it poll? Does the receiving agent push a result? Is there a timeout? Without a defined completion protocol, you get agents waiting indefinitely, agents re-dispatching work that is already in progress, and the coordination overhead growing faster than the work itself.

Coordination Patterns: Three That Work, One That Looks Like It Does

The multi-agent frameworks converging in 2025 and 2026 have settled on a handful of coordination patterns. Understanding the tradeoffs between them is the difference between a system that scales and one that collapses under its own messaging overhead.

Pipeline coordination is the simplest. Agent A finishes, passes output to Agent B, who passes output to Agent C. This is sequential, easy to reason about, and easy to instrument. LangGraph implements this as a graph with linear edges. The failure modes are well-understood: if B fails, you know exactly where the chain broke and what needs to be retried. The limitation is equally obvious -- it is serial, so latency is the sum of all stages, and there is no parallelism to exploit.

Pipeline works best when the task has natural stages that cannot overlap. Document processing is the classic case: extract, transform, validate, load. Each stage needs the prior stage's output. Trying to parallelize a pipeline that has genuine data dependencies is not clever engineering. It is a race condition waiting to happen.

Fan-out/fan-in is what you reach for when a coordinator agent can decompose a task into independent subtasks. The coordinator dispatches work to specialist agents in parallel, then aggregates their results. CrewAI's hierarchical process implements this pattern, with a manager agent distributing tasks and collecting outputs. LangGraph supports it with parallel branching and synchronization barriers.

The engineering challenge in fan-out/fan-in is the fan-in. Dispatching work is easy. Knowing when all results are back, handling partial failures (three of four specialists succeeded -- is that good enough?), and merging potentially conflicting outputs from independent agents -- that is where the complexity lives. This is not new. MapReduce solved the same problem. The difference is that MapReduce workers are deterministic functions, and your specialist agents are language models that might return structurally different responses to the same prompt.

Define your merge strategy before you fan out. If you cannot specify how conflicting results will be reconciled, you are not ready for this pattern.

Blackboard coordination is the oldest of the three -- Hayes-Roth described it in 1985 -- and it is making a comeback in multi-agent AI. In the blackboard pattern, agents share a common data structure. Each agent reads the current state, decides whether it can contribute, and writes its contribution back. There is no explicit delegation. Agents self-select based on the state of the problem.

Recent research from arXiv demonstrates that blackboard architectures can deliver strong accuracy with significant token efficiency in LLM-based multi-agent systems. AWS's Strands framework uses a variant of this pattern for multi-agent collaboration. The appeal is flexibility: you do not need to predefine the workflow, and agents can contribute opportunistically.

The danger is also flexibility. A blackboard is shared mutable state, and shared mutable state is where distributed systems go to create bugs that only appear under load, on Tuesdays, when the moon is in the right phase. Which brings us to the part that matters most.

Shared State Is a Solved Problem. The Solution Is Not "Let Everyone Write."

Here is the pattern that repeats itself every time a new paradigm discovers concurrency: multiple actors need to coordinate, someone builds shared mutable state because it is the obvious solution, and then six months later the team is debugging race conditions, lost updates, and ordering anomalies that are nearly impossible to reproduce.

Leslie Lamport published "Time, Clocks, and the Ordering of Events in a Distributed System" in 1978. The core insight -- that you cannot establish a global ordering of events in a distributed system without explicit coordination mechanisms -- is forty-seven years old and still catches people off guard. Raft and Paxos exist because this problem is hard enough that it needed formal consensus protocols, not ad hoc locking.

Multi-agent systems that share state are distributed systems that share state. The fact that the actors are language models instead of microservices does not change the fundamental problem. When two agents read the same blackboard, both decide to act, and both write back, you get a last-writer-wins conflict that silently discards one agent's work. When an agent reads stale state because another agent's write has not propagated yet, it makes decisions based on information that is already wrong.

The multi-agent frameworks are beginning to address this. Agent Blackboard on GitHub implements a shared knowledge repository with explicit contribution tracking. Confluent has published patterns for event-driven multi-agent systems that use event logs instead of mutable state, turning the coordination problem into an append-only log that preserves ordering.

The practical advice is straightforward: prefer event logs over mutable shared state. If agents must share state, use optimistic concurrency control with version checks -- the same pattern every database learned to implement thirty years ago. If you find yourself building a custom locking protocol for agent coordination, stop. You are reinventing Raft, and your version will have bugs that Raft does not.

Budget Allocation Across Agent Pools

A single agent with a budget cap is manageable. We covered the scoping dimensions in our agent budgeting piece -- per-transaction, per-workflow, per-time-period. But multi-agent coordination introduces a new dimension: how do you allocate budget across a pool of agents that share objectives but operate independently?

The naive approach is a shared budget pool. Give five agents access to the same $100 daily allocation and let them draw from it as needed. This is the financial equivalent of shared mutable state, and it fails in the same way. Two agents check the remaining balance, both see $15, both initiate a $12 tool call, and now you are $9 over budget. This is not hypothetical. This is the exact concurrency problem that bank ATMs solved in the 1970s with reservation-based accounting.

The better approach is budget partitioning. Each agent gets an individual allocation. If an agent exhausts its partition, it stops or escalates -- it does not borrow from another agent's partition without explicit authorization. This is how cloud providers handle resource quotas across teams, and for the same reason: isolation prevents one runaway actor from starving the others.

Platforms like AgentPMT already implement per-agent budget controls with daily, weekly, and per-transaction caps. The coordination question is not whether to enforce limits -- it is how to structure them when agents operate as a team. The answer is usually hierarchical: a coordinator agent holds the team budget, partitions it across specialists, and can reallocate when priorities shift. The specialists operate within their partitions without needing to coordinate on spend. This mirrors how engineering managers allocate cloud budgets to teams: clear boundaries, explicit escalation for overages, no silent sharing.

Preventing Work Duplication

When multiple agents operate on overlapping domains, duplication is the default failure mode. Two research agents, both tasked with gathering competitive intelligence, will produce overlapping reports. Two coding agents, both responding to related bug reports, will make conflicting changes to the same file. Two outreach agents, both working a lead list, will email the same prospect twice.

Preventing duplication requires one of two strategies: domain partitioning or claim-based locking.

Domain partitioning assigns non-overlapping scopes to each agent. Agent A handles prospects with last names A-M. Agent B handles N-Z. This is crude but effective, and it eliminates coordination overhead entirely. The tradeoff is load imbalance and the rigidity of the partition scheme.

Claim-based locking is more flexible. Before an agent begins work on a task, it claims it in a shared registry. Other agents check the registry before starting, and skip tasks that are already claimed. This is a checkout protocol, and it works well when the registry has strong consistency guarantees. It works poorly when agents cache the registry locally and act on stale data -- which, again, is the distributed systems consistency problem wearing a different hat.

DynamicMCP's architecture points toward a third approach: centralizing task dispatch through a coordinator that maintains assignment state. Instead of agents self-selecting and hoping they do not collide, a coordinator maintains the work queue, assigns tasks, and tracks completion. The agents are workers; the coordinator is the scheduler. This is the pattern behind every reliable job queue from Celery to SQS, and it works for agent coordination for the same reasons it works for task queues: single-writer assignment eliminates the duplication problem at the source.

What This Means for Teams Building Multi-Agent Systems

Multi-agent coordination is not a framework problem — it is a distributed systems problem wearing an AI hat. The teams that recognize this early will avoid the most expensive mistakes: shared mutable state without concurrency controls, budget pools without partitioning, and delegation protocols without completion signals.

AgentPMT's architecture addresses the coordination layer directly. DynamicMCP centralizes tool discovery and access, so adding or removing tools from the available set is a single operation that takes effect across all connected agents. The multi-budget system supports hierarchical allocation — organization-level caps, team-level partitions, and per-agent spending limits — eliminating the shared-budget-pool concurrency problem. The mobile app gives coordinators real-time visibility into agent activity across the entire fleet, with the ability to pause individual agents or adjust budgets without redeployment.

The window for getting multi-agent coordination right is while deployments are still small enough to restructure without a migration project. Build the coordination layer now, and every agent you add operates within a framework that handles the distributed systems problems your team should not have to solve from scratch.

What to Watch

Three developments will shape multi-agent coordination over the next year.

Protocol convergence is accelerating. Google's A2A protocol moved to the Linux Foundation and shipped version 0.3 with gRPC support and signed security cards. Microsoft's Agent Framework is targeting stable APIs in early 2026. The window where every team had to build custom agent-to-agent communication is closing. Teams that adopt standardized protocols now will avoid a painful migration later when the standards solidify. The important thing to watch is not which protocol wins, but whether capability discovery -- knowing what another agent can actually do -- becomes standardized alongside messaging.

Budget and resource coordination will become a first-class framework concern. Today, most multi-agent frameworks treat budget as an afterthought -- if they address it at all. As agent teams grow from two or three to dozens, budget partitioning, spend attribution per agent, and cross-agent resource quotas will move from "nice to have" to "required for production." The frameworks that build this in will win adoption among teams that actually ship to production.

Event-driven coordination will replace shared-state coordination. The same transition happened in microservices: teams started with shared databases, discovered the pain, and moved to event-driven architectures. Multi-agent systems will follow the same arc. Expect to see more frameworks adopt event logs, message buses, and append-only shared structures instead of mutable blackboards. The distributed systems playbook is clear on this, and the agent ecosystem will follow it -- the only question is how many teams learn the lesson the hard way first.

The distributed systems playbook is clear. The agent ecosystem will follow it — the only question is how many teams learn the lesson the hard way first. Build on infrastructure that already solves the coordination problems, and focus your engineering on what your agents actually do, not on how they talk to each other.

AgentPMT provides the coordination infrastructure — centralized tool governance, hierarchical budget controls, and structured audit trails across every connected agent. See how it works

Key Takeaways

  • Agent-to-agent contracts require four explicit components -- capability boundaries, input/output schemas, delegation protocols, and completion signals -- because unlike tool contracts, both parties are autonomous and non-deterministic. Skipping any of the four produces coordination failures that surface as duplicated work, lost context, or agents waiting for handoffs that never arrive.
  • Shared mutable state between agents is the same distributed consistency problem that Raft, Paxos, and every database's concurrency control system exists to solve. Prefer event logs and append-only structures over mutable shared state, and use optimistic concurrency control with version checks when shared state is unavoidable. If you are building a custom locking protocol, you are making a mistake that has already been made and corrected many times.
  • Budget allocation across agent pools requires partitioning, not sharing. A shared budget pool between multiple agents has the same concurrency failure modes as a shared bank account with multiple cardholders and no coordination. Partition budgets per agent, let a coordinator handle reallocation, and use per-agent spending caps as the isolation boundary.

Sources