Designing Deterministic Tools for Non-Deterministic Agents

Designing Deterministic Tools for Non-Deterministic Agents

By Stephanie GoodmanFebruary 12, 2026

The most expensive assumption in agent engineering is that the model will be careful. Here is why tool design -- not prompt engineering -- is the primary reliability lever for agent systems.

Successfully Implementing AI AgentsMCPAI Powered InfrastructureAgentPMTDynamicMCPAI MCP Tool Management

The most expensive assumption in agent engineering is that the model will be careful.

A team ships a tool that accepts a freeform string for a dollar amount. The schema says type: string. The model sends "$50.00" one time, "50" the next, and "fifty dollars" the third. Three calls, three formats, one confused downstream system. Nobody finds it until the reconciliation report does not reconcile. The fix takes four hours. The root cause was not the model. It was the tool designer who treated the input schema like a suggestion instead of a contract.

This happens constantly, and it will keep happening until tool builders accept a simple truth: in a system where the caller is non-deterministic, the callee must be ruthlessly deterministic. The boundary between agent and tool is where you convert chaos into order -- or where you let chaos through and pay for it later. It is the design philosophy behind AgentPMT's DynamicMCP — every tool in the marketplace defines strict input and output schemas, and the gateway validates requests at the boundary before any handler executes.

This article is for engineers who build and maintain the tools that agents call. Not the operators running agent fleets (we covered that in our workflow reliability piece), and not the prompt engineers tuning model behavior. This is about the tool interface itself -- the schema, the validation, the defaults, and the contract -- as the primary lever for system reliability.

Why Tool Design Is the Actual Reliability Lever

The industry has spent enormous energy on prompt engineering as a reliability strategy. Write better instructions. Add more examples. Refine the system message. This is not wrong, but it is brittle. A prompt is a request. A tool schema is a constraint. Requests can be misunderstood. Constraints cannot.

Consider the difference. You can write a prompt that says "always pass amounts as integers in cents." The model will comply most of the time. Or you can define the schema as { "amount_cents": { "type": "integer", "minimum": 1, "maximum": 100000 } } and reject anything that does not match. The first approach requires the model to remember and comply. The second approach makes non-compliance structurally impossible.

This is not a new idea. Bertrand Meyer called it Design by Contract in the 1980s: preconditions that the caller must satisfy, postconditions that the function guarantees, invariants that hold throughout. The tool boundary in an agent system is exactly the right place to apply this thinking -- except your caller is a probabilistic language model that has never read your documentation and will cheerfully hallucinate parameter values if the schema lets it.

The implication is significant. Tool design determines system reliability more than prompt engineering does. A well-designed tool makes the wrong call impossible. A poorly-designed tool makes the wrong call inevitable, then blames the model.

The Philosophy: Make Carelessness Impossible

There is a saying in safety engineering: do not tell people to be careful; redesign the system so that carelessness does not matter. The same principle applies to agent tools with even more force, because your "person" is a statistical model that does not understand consequences.

Postel's Law -- "be conservative in what you send, be liberal in what you accept" -- was written for an internet of cooperating human-operated systems. It has been widely criticized even in that context. Martin Thomson and David Schinazi argued in an IETF draft that liberal acceptance actually degrades robustness by allowing ambiguity to propagate. For agent-facing tools, Postel's Law is actively dangerous. A liberal parser that accepts "true", "yes", "1", and "Y" as boolean values is not being helpful. It is creating a combinatorial explosion of possible inputs that will eventually produce a combination nobody tested.

The correct posture for agent tools is the opposite of Postel's Law: be strict in what you accept and explicit in what you return. If the input does not exactly match the schema, reject it with a clear, structured error. The model will fix it on the next call. That retry is dramatically cheaper than the debugging session that follows a silently accepted bad input.

Three design principles follow from this:

No implicit coercion. If the schema says integer, do not silently parse a string. If the schema says ISO 8601, do not accept "next Tuesday." Every silent coercion is a decision the tool is making on behalf of the agent, and the agent does not know it happened.

No optional fields where you actually need a value. If your tool requires a currency code to function correctly, do not make it optional with a default of "USD." Make it required. A model that forgets to pass a currency code should get a validation error, not a silent assumption that works in the US and fails everywhere else.

No open-ended string fields for structured data. If the field is an email address, validate it as an email address. If it is a country code, use an enum. If it is a monetary amount, use an integer in the smallest unit. Every string field that could be an enum or a constrained type is a place where ambiguity enters the system.

What a Production-Grade Tool Contract Looks Like

A tool contract has four layers. Most tool builders ship the first one and skip the rest.

Layer 1: Input schema. This is where most people stop. Define every parameter with the tightest type possible. Use JSON Schema's additionalProperties: false so the model cannot invent fields. Use enums for closed sets. Use minimum and maximum for numeric bounds. Use pattern for string formats. The MCP specification uses JSON Schema for tool input definitions -- this is not extra work, it is the standard.

Layer 2: Output schema. Equally important, often ignored. Define the shape of your response so that downstream consumers can validate it. The November 2025 MCP specification added Tool Output Schemas precisely because tool consumers need to know what to expect. If your tool returns different shapes depending on internal state, you have created a non-deterministic interface on a tool that should be deterministic.

Layer 3: Error contract. Define a finite set of error codes and what each one means. The model needs to know whether an error is retryable (rate limit, timeout) or terminal (invalid input, insufficient permissions). If your tool returns generic 500 errors with human-readable messages, the model will guess what to do next. It will guess wrong.

Here is the difference between a bad error and a good one:

// Bad: the model has to parse English to decide what to do
{ "error": "Something went wrong, please try again later" }

// Good: the model can branch on structured fields
{
  "error_code": "RATE_LIMITED",
  "retryable": true,
  "retry_after_seconds": 30,
  "message": "Rate limit exceeded for this endpoint"
}

Layer 4: Side-effect declaration. Does this tool read or write? Can it be safely retried? Does it require an idempotency key? These are not implementation details -- they are contract terms that the orchestration layer needs to make correct decisions. Stripe got this right years ago: every mutating endpoint accepts an Idempotency-Key header, and the server stores the result so that retries return the same response.

Good Interfaces vs. Bad Interfaces: Concrete Examples

Abstract principles are helpful. Concrete examples are more helpful.

Example 1: Sending an email

Bad interface:

{
  "name": "send_email",
  "parameters": {
    "to": { "type": "string" },
    "subject": { "type": "string" },
    "body": { "type": "string" }
  }
}

This accepts "to": "john" (not an email), "to": "john@example.com, jane@example.com" (two addresses in one string), and "to": "" (empty string). Every one of those will produce a different failure mode downstream, none of which the model can anticipate.

Good interface:

{
  "name": "send_email",
  "parameters": {
    "to": {
      "type": "array",
      "items": { "type": "string", "format": "email" },
      "minItems": 1,
      "maxItems": 10
    },
    "subject": { "type": "string", "minLength": 1, "maxLength": 200 },
    "body_text": { "type": "string", "minLength": 1, "maxLength": 50000 },
    "idempotency_key": { "type": "string", "format": "uuid" }
  },
  "required": ["to", "subject", "body_text", "idempotency_key"],
  "additionalProperties": false
}

Recipients are an array with bounds. The subject has a length constraint. The idempotency key is required because email is a side effect. additionalProperties: false prevents the model from inventing a "priority": "urgent" field that gets silently ignored.

Example 2: Creating a database record

Bad interface:

{
  "name": "create_record",
  "parameters": {
    "data": { "type": "object" }
  }
}

This accepts literally anything. The model could pass {"name": "Alice"} or {"x": [1,2,3], "y": null, "z": {"a": {"b": "c"}}}. You have moved all validation into the tool's implementation, where every edge case becomes a runtime surprise.

Good interface:

{
  "name": "create_customer_record",
  "parameters": {
    "customer_name": { "type": "string", "minLength": 1, "maxLength": 200 },
    "email": { "type": "string", "format": "email" },
    "tier": { "type": "string", "enum": ["free", "pro", "enterprise"] },
    "idempotency_key": { "type": "string", "format": "uuid" }
  },
  "required": ["customer_name", "email", "tier", "idempotency_key"],
  "additionalProperties": false
}

The tool name says what kind of record. Every field is typed and constrained. The enum prevents the model from inventing a "premium" tier that does not exist. The tool is narrow on purpose -- one tool per record type, not one tool for all records.

Validation at the Boundary, Not Inside the Handler

Where you validate matters as much as what you validate.

The worst pattern is "accept everything, validate deep inside the handler." By the time your handler discovers that the currency code is invalid, you have already deserialized the request, opened a database connection, started a transaction, and maybe acquired a lock. Now you have to unwind all of that, return an error, and hope the model figures out what went wrong.

The correct pattern is to validate at the boundary -- before the handler runs, before any resources are acquired, before any side effects begin. JSON Schema validation is deterministic, fast, and well-supported in every language. Run it first. If the input fails validation, return a structured error immediately. The handler never sees bad data.

This is the same principle behind API gateways, request validation middleware, and input sanitization -- except in the agent context, it is even more important because your caller will send you inputs you never imagined. A human developer reads the docs and sends reasonable values. A language model reads the schema (or a summary of the schema) and sends whatever is statistically likely given its training data. That is a much wider distribution of inputs than any human caller would produce.

DynamicMCP, the MCP server we built at AgentPMT, enforces schema validation at the gateway layer before any tool handler executes. Every tool in the marketplace defines strict input and output schemas. When an agent sends a malformed request, it gets a structured rejection before the tool ever runs. This is not a feature we added for convenience -- it is the architectural foundation that makes a multi-tenant tool marketplace viable. Without boundary validation, one misbehaving agent could send garbage to every tool in the system.

Schema Design Principles That Actually Work

After building and reviewing hundreds of tool schemas, a few principles have emerged that consistently produce more reliable agent behavior.

Name tools by outcome, not by mechanism. create_invoice is better than write_record. check_domain_availability is better than dns_lookup. The model selects tools partly based on the name. A descriptive name reduces the chance that the model picks the wrong tool or passes the wrong arguments because it misunderstood what the tool does.

Normalize inputs aggressively. If your tool accepts a date, require ISO 8601. If it accepts a phone number, require E.164. If it accepts a country, require ISO 3166-1 alpha-2. Do not accept "US", "USA", "United States", and "us" and normalize internally. Require the canonical form and reject everything else. The retry is cheap. The downstream bug from a normalization edge case is not.

Use safe defaults that limit blast radius. If your tool fetches records, default to a small page size (10, not 1000). If your tool sends notifications, default to draft mode, not live. If your tool modifies data, default to dry-run. The model should have to explicitly opt into the dangerous mode, not accidentally arrive there by omitting a parameter.

Make every side effect idempotent. This bears repeating because it is the single most consequential design decision for agent-facing tools. Agents retry. Networks retry. Orchestrators retry. If your tool creates a resource on every call, you will get duplicate resources. Require an idempotency key. Store the result of the first successful execution. Return the stored result for subsequent calls with the same key. Stripe's API has worked this way since 2015. It is not novel; it is just not done often enough.

Keep the schema surface small. A tool with three required parameters and zero optional ones is easier for a model to call correctly than a tool with three required parameters and fifteen optional ones. If you need the complexity, split it into multiple tools. A model choosing between send_draft_email and send_live_email will make fewer mistakes than a model that has to remember to pass "mode": "draft" to a single send_email tool.

How This Changes the Economics

Bad tool design has a direct cost. Every malformed input that gets accepted and fails downstream costs you a retry cycle -- more tokens for the model to process the error, more time, more tool calls. In a system where agents make hundreds of calls per hour, sloppy schemas compound into real money.

AgentPMT's budget controls operate at the tool-call level precisely because tool calls are the unit of spend. When a tool rejects bad input at the boundary, the agent corrects and retries with a single additional model call. When a tool accepts bad input and fails three layers deep, the agent may retry the entire operation multiple times before succeeding -- or give up and escalate to a human. The difference between these two paths can be a 5x cost difference per task.

This is why we treat tool quality as a marketplace curation problem, not just a listing problem. A tool with a strict schema and structured errors costs less to use than a tool with a loose schema and generic errors, even if they do the same thing. The total cost includes the retries and failures, not just the sticker price.

What This Means for Tool Builders

Deterministic tool design is not a nice-to-have quality attribute. It is the primary reliability lever for any agent system, and it directly affects the economics of every workflow that calls your tool. A strict schema with structured errors costs less to use than a loose schema with generic errors — even at the same sticker price — because the total cost includes retries, debugging time, and silent failures.

AgentPMT's marketplace enforces this at the infrastructure level. DynamicMCP validates every request against the tool's published schema before the handler executes — malformed inputs get rejected with structured errors at the gateway, not deep inside application logic. Budget controls operate at the tool-call level, so cost attribution is precise and retry costs are visible. The audit trail captures every invocation with parameters, cost, and outcome, giving both tool builders and tool consumers the data they need to identify and fix reliability issues.

For tool builders considering the AgentPMT marketplace, the standard is strict schemas, typed errors, and idempotent side effects. Tools that meet this standard perform better, cost less to use, and attract more usage — because in a marketplace where agents choose tools programmatically, reliability is the feature that wins.

What to Watch

MCP specification evolution. The November 2025 MCP spec added output schemas and the Tasks primitive for async operations. As MCP matures, expect stricter requirements around error contracts and side-effect declarations. Tool builders who adopt these patterns now will be ahead of the specification, not scrambling to catch up with it.

Schema-first toolchains. Google released a Go JSON Schema package in January 2026 specifically to support MCP tool definitions and structured LLM interactions. The ecosystem is converging on JSON Schema as the standard for defining tool interfaces. Investment in schema quality today compounds as the tooling improves.

Validation as a service layer. As agent tool ecosystems grow, expect dedicated validation middleware to emerge -- gateway layers that enforce schemas, rate limits, and idempotency before requests reach tool handlers. This is the direction infrastructure needs to go, and it is the architectural bet we made with DynamicMCP.

The Argument in One Sentence

If you want deterministic outcomes from a non-deterministic caller, stop hoping the caller behaves and start making the interface refuse to cooperate with anything ambiguous.

The tools are the boundary. The schema is the contract. The validation is the enforcement. Get those right, and the model's non-determinism stops being your problem.

AgentPMT's DynamicMCP enforces strict tool contracts at the gateway layer — schema validation, structured errors, and budget controls on every call. Build tools that agents can rely on. List your tool on the marketplace

Key Takeaways

  • Tool design is the primary reliability lever in agent systems -- a strict schema prevents more failures than a careful prompt, because constraints are structural and prompts are suggestions.
  • Validate at the boundary, not inside the handler. Reject malformed input before any resources are acquired or side effects begin. The retry from a clean rejection is orders of magnitude cheaper than debugging a silent failure.
  • Make every side-effecting tool idempotent by default, keep the schema surface small, and use safe defaults that limit blast radius. Design for the assumption that your caller will be creative in ways you did not anticipate.

Sources