MCP Tool Packaging: From Script to Agent Product

Every developer who has wrapped an API in a Python function and called it "done" has had the same experience six weeks later: the function works, but nobody else can use it. The parameters are named x and data. The errors come back as raw stack traces. There is no documentation beyond a comment that says # TODO: add docs. And when an AI agent tries to call it, the agent either hallucinates the input format or silently passes garbage and gets garbage back.

The Model Context Protocol has made it remarkably easy to expose a function as a tool. The Python SDK lets you add a @mcp.tool() decorator and get a working MCP server in under ten lines of code. The TypeScript SDK gives you Zod-validated schemas with server.registerTool(). But "working" and "production-grade" are separated by a canyon of missing metadata, untested edge cases, and undocumented side effects. MCP made the wiring trivial. The hard part was always the contract.

This article is about crossing that canyon -- turning a script that works on your laptop into something an autonomous agent can discover, understand, invoke correctly, handle errors from, and pay for. It is the kind of problem that platforms like AgentPMT are designed to solve at scale through their DynamicMCP marketplace, where tools must meet strict metadata and packaging standards before agents can discover and use them. The difference between a script and a product comes down to discipline.

What Separates a Script from an Agent-Grade Tool

A script works when a human runs it. A tool works when software runs it unsupervised. The distinction sounds pedantic until you watch an agent retry a failed call seventeen times because the error response was a string that said "something went wrong."

Agent-grade tools share a specific set of properties that scripts almost never have, and every one of these properties exists to eliminate a category of ambiguity that agents cannot resolve on their own.

Schema completeness. Every input parameter needs a type, a description, and constraints. The MCP specification uses JSON Schema for this, and both official SDKs generate schemas from type annotations. In Python, FastMCP infers the schema from function signatures and docstrings. In TypeScript, you define input schemas with Zod validators that enforce types at runtime. But auto-generation is a starting point, not a finish line. A parameter typed as string with no description is technically valid and practically useless. An agent looking at a tool called process with a parameter called input of type string has no basis for deciding what to pass. The description field is not documentation for humans -- it is the interface contract for the agent. As Merge's analysis of MCP tool descriptions puts it, descriptions answer "What does this do?" while schemas answer "How do I use it?" Both need to be precise.

Error taxonomy. A production tool needs to distinguish between errors the caller caused (bad input), errors that are transient (timeout, rate limit), and errors that are permanent (resource not found, permission denied). This is not about HTTP status codes -- MCP tools return structured content, not HTTP responses. The distinction matters because agents use error categories to decide what to do next. A transient error means retry. A caller error means fix the input. A permanent error means stop trying. If your tool returns the same generic error string for all three cases, the agent will either retry forever or give up immediately. Neither is correct.

Idempotency. When an agent retries a tool call -- and it will, because networks fail and timeouts happen -- the second call should produce the same result as the first, not a duplicate side effect. If your tool creates a record, it needs an idempotency key so that a retry creates one record, not two. If your tool sends an email, it needs to check whether that email was already sent. This is not optional for tools that write data. The retry-safety of a tool is part of its contract, and agents need to know whether retries are safe before they attempt them.

Pricing metadata. If a tool costs money to call -- and most useful tools do, whether through API fees, compute costs, or transaction charges -- the agent needs to know the pricing unit and the price before calling it. Is it priced per request? Per minute of compute? Per megabyte of data processed? An agent operating under a budget cannot make rational decisions about tool selection without this information. AgentPMT's budget controls address this directly -- operators set per-agent spending limits and the platform enforces them at the tool-call level, so agents can only spend what they are authorized to spend. A small manifest covers the tool side:

{
  "pricing_unit": "request",
  "price_usd": 0.02,
  "side_effects": "none",
  "data_shared": ["email"],
  "retriable_errors": ["rate_limited", "timeout"]
}

This is not part of the MCP specification today, but it is the kind of metadata that separates a tool from a product. If you cannot state what your tool costs, what data it shares, and which errors are safe to retry, the tool is not ready for autonomous use.

Examples. Include at least one example input-output pair for every tool. Agents use examples to calibrate how to construct calls. Humans use examples to verify they understand the interface. Examples are the cheapest form of testing -- if your example does not produce the expected output, you have a bug. If you do not have examples, you have no way to detect drift between what the tool claims to do and what it actually does.

Building an MCP Tool with the SDKs

The two official MCP SDKs -- Python and TypeScript -- take different approaches to the same problem, but both aim to minimize the distance between "I have a function" and "I have a tool with a schema."

In Python, the FastMCP class uses type hints and docstrings to generate tool definitions automatically. A function like this:

@mcp.tool()
async def get_forecast(latitude: float, longitude: float) -> str:
    """Get weather forecast for a location.

    Args:
        latitude: Latitude of the location
        longitude: Longitude of the location
    """

...becomes a tool with a JSON Schema that specifies latitude and longitude as required number parameters with descriptions. The decorator handles registration, and mcp.run(transport="stdio") starts the server. The Python SDK requires version 3.10+ and supports structured output validation through Pydantic models, TypedDicts, and dataclasses.

In TypeScript, schemas are explicit. You use Zod to define input validation:

server.registerTool(
  "get_forecast",
  {
    description: "Get weather forecast for a location",
    inputSchema: {
      latitude: z.number().min(-90).max(90)
        .describe("Latitude of the location"),
      longitude: z.number().min(-180).max(180)
        .describe("Longitude of the location"),
    },
  },
  async ({ latitude, longitude }) => {
    // implementation
  }
);

The Zod approach is more verbose but gives you runtime validation with constraints -- .min(-90).max(90) is a constraint that the Python version lacks unless you add custom validation. The TypeScript SDK uses Zod v4 as a required peer dependency, and the .describe() calls on each field are what agents actually see when they inspect the tool schema.

Both SDKs support stdio and HTTP transports. For local development and testing, stdio is simpler. For production deployment and remote access, HTTP (specifically the Streamable HTTP transport) is the direction the ecosystem is moving. One critical note for stdio servers that both SDKs emphasize: never write to stdout. print() in Python and console.log() in JavaScript will corrupt the JSON-RPC message stream and break your server. Use logging (Python) or console.error (JavaScript) for debug output.

The SDKs handle the protocol plumbing. What they do not handle is the quality of your tool's contract. A tool with a perfect schema and terrible error messages is still a bad tool.

Writing Tool Descriptions That Agents Can Use

This is where most developers underinvest, and it is where the difference between "works in demo" and "works in production" lives.

An agent deciding which tool to call looks at three things: the tool name, the tool description, and the input schema. That is the entire interface. There is no README. There is no Slack channel to ask questions in. There is no "just look at the source code." The tool description is the documentation, and it needs to work for a reader that takes everything literally.

Docker's MCP server team, after building and reviewing over 100 MCP servers for their MCP Catalog, identified a consistent pattern: developers think about the end user, but the actual consumer of the tool interface is the agent. As Ivan Pedrazas, Docker's Principal Software Engineer, wrote in their best practices guide, "When you're building an MCP server, you're not interfacing directly with users. You're building for the agent that acts on their behalf."

This has concrete implications for how you write descriptions and handle errors.

Tool names should be verb-noun pairs. get_forecast, create_invoice, search_contacts. Not process, run, handle. The name is the first filter an agent uses to decide relevance.

Descriptions should front-load the critical information. Agents may truncate long descriptions, so the first sentence needs to carry the weight. "Search for available flights between two airports on a specific date" is better than "This tool provides comprehensive flight search capabilities across multiple airline partners with support for various filtering options." The first one tells the agent what it does. The second one tells the agent nothing.

Be explicit about limitations. If a tool only works with US locations, say so in the description. If it returns a maximum of 100 results, say so. If it requires a prior call to another tool, specify the dependency: "Create a new contact. Required: call discover_required_fields('Contact') first." Agents cannot infer limitations. They discover them by failing. Every limitation you document is a failure you prevent.

Error messages should help the agent decide what to do next, not just describe what went wrong. "Access denied" tells the agent nothing actionable. "Access denied: the API_TOKEN environment variable is not configured or has expired. Reconfigure the MCP server with a valid token." tells the agent (and the human monitoring it) exactly what to fix. Docker's team found this to be one of the most consistent issues across MCP server submissions -- error messages designed for humans rather than for the software that actually receives them.

Testing MCP Tools

The MCP project ships an Inspector tool that serves as the primary testing interface for MCP servers during development. It runs directly through npx with no installation required:

npx @modelcontextprotocol/inspector <command>

The Inspector provides a browser-based UI where you can connect to your server, list available tools, inspect schemas, and execute tool calls with custom inputs. It shows the raw JSON-RPC messages flowing between client and server, which is invaluable for debugging schema mismatches and transport issues.

But functional testing -- "does the tool return the right output for this input?" -- is the easy part. The harder testing categories are the ones most developers skip.

Contract testing. Your tool's schema is a contract. When you update your tool, does the schema change? If it does, do existing callers break? Contract tests verify that the schema remains stable across versions, that required fields stay required, that return types do not change, and that error codes remain consistent. Run these tests in CI against every commit. If a schema change is intentional, the contract test failure forces you to acknowledge it explicitly rather than shipping a breaking change by accident.

Edge case testing. What happens when the input is valid but unusual? An empty string where a name is expected. A latitude of exactly 90.0. A date in the far future. A request with all optional fields omitted. These are the cases that expose assumptions baked into your implementation that are not reflected in your schema. If your schema says a field is a string with no constraints, and your implementation crashes on an empty string, your schema is lying.

Failure mode testing. Docker's best practices guide recommends a specific lifecycle test: can you connect to the server and list tools even when the server's backend dependencies (databases, APIs) are misconfigured? If your server establishes a database connection on startup and the database is down, the entire server fails to start, and agents cannot even discover what tools are available. The better pattern is to create connections per tool call, not per server lifecycle. You trade a small amount of latency for the ability to list tools and return meaningful errors even when backends are unavailable.

The ugly test. Force a timeout mid-operation and confirm that no side effect was duplicated. If your tool creates a payment, force a network failure after the payment is submitted but before the confirmation is returned. Does the retry create a second payment? This is the test that separates demos from production systems, and most developers never run it. Production environments that support credential isolation and audit trails -- like those offered through AgentPMT -- make failure mode testing more tractable, because you can trace exactly which credentials were used and what operations were attempted during a failed sequence.

Versioning and Distribution

Once a tool works and is tested, you need to get it to users. MCP tool distribution today happens through a few channels: npm packages, PyPI packages, Docker images, and direct configuration in client apps like Claude Desktop or VS Code.

Docker's position -- and they are not wrong -- is that containerization is the most portable packaging format. A Docker image bundles your server, its dependencies, and its runtime into a single artifact that works on any system that runs Docker. No Python version conflicts, no Node.js compatibility issues. Microsoft took a different approach with their Azure DevOps MCP Server, releasing it as an open-source repository with expanded documentation -- prioritizing transparency and community contribution over packaged distribution. Both approaches work. The choice depends on your audience and your operational constraints.

What matters more than the packaging format is the versioning discipline. MCP tools, once installed, become part of someone's workflow. They are dependencies in the truest sense. When you change a tool's schema, error format, or behavior, you are making a change that propagates to every agent using that tool.

Pin versions. Make upgrades explicit. Log capability diffs between versions so that operators can see exactly what changed. If you add a required parameter to a tool, that is a breaking change even if the tool still "works" -- every existing caller that does not provide the new parameter will fail. Treat tool version bumps with the same seriousness you treat API version bumps, because they carry the same consequences.

The MCP ecosystem does not yet have a universal registry or versioning standard. This is both a limitation and an opportunity. It means the teams that establish good versioning practices now -- semantic versioning, changelogs, deprecation notices -- will build trust with adopters, while the teams that ship breaking changes silently will lose users the moment an alternative exists. Platforms that provide vendor whitelisting give operators additional control here, allowing them to approve specific tool providers and versions before agents can access them.

From Tool to Product

Productizing a tool is the step that turns a useful function into something worth listing in a catalog or marketplace. It requires thinking about the tool from the perspective of someone who has never seen your code, does not know your API, and will evaluate whether to use your tool based entirely on its metadata.

A productized tool has a name that describes what it does, a description that an agent can parse, a schema that validates inputs before they reach your code, error messages that enable automated recovery, pricing metadata that enables budget-aware agents, examples that demonstrate correct usage, and version information that enables safe upgrades. Each of these properties removes a category of ambiguity. Ambiguity is the enemy of autonomous operation.

This is the problem that AgentPMT's DynamicMCP was built to address. When agents discover tools on demand through a centralized catalog, the quality of tool metadata is not just nice-to-have -- it is the mechanism by which discovery works. Tools with incomplete schemas or missing descriptions do not surface in search results because agents literally cannot determine what they do. The catalog enforces a minimum bar for tool metadata because without that bar, discovery degrades into guessing. When a tool also needs usage-based payment, x402Direct provides the payment challenge natively in the HTTP request, so the agent can budget and pay without a separate billing integration. Operators can manage all of this through the AgentPMT mobile app and workflow builder, giving them visibility into which tools their agents use and how much they spend.

Productizing a tool is not about marketing. It is about removing every piece of information that currently lives in a developer's head and encoding it in machine-readable metadata. The tool should be fully self-describing. If an agent cannot figure out how to use your tool from the tool's own metadata, the tool is not a product yet.

What This Means for Tool Builders

The shift from scripts to agent-grade products is not just a technical upgrade -- it is a change in who your user is. When the consumer of your tool is an autonomous agent rather than a human developer, every shortcut you took in documentation, error handling, and schema design becomes a point of failure. The bar is higher because the tolerance for ambiguity is zero.

For independent developers and small teams, this is an opportunity. The MCP ecosystem is still early enough that high-quality tools stand out. A well-packaged tool with complete schemas, clear error taxonomy, idempotency guarantees, and pricing metadata will earn trust and adoption faster than a technically superior tool with sloppy packaging. The tooling exists -- the SDKs handle the protocol, the Inspector handles testing, and platforms like AgentPMT handle discovery, payment, and distribution through DynamicMCP.

For enterprises building internal tools, the implications are about governance. When agents can discover and invoke tools autonomously, the organization needs controls over which tools are available, who pays for them, and what data they access. Credential isolation ensures that agents use scoped credentials rather than shared secrets. Audit trails provide the record of what was called, when, and with what result. These are not optional features for regulated industries -- they are prerequisites for deploying autonomous agents in production.

What to Watch

Three trends will shape MCP tool packaging over the next twelve months.

First, registry convergence. Docker's MCP Catalog, npm, PyPI, and various community registries all distribute MCP servers today, but there is no unified discovery mechanism. Expect consolidation around a smaller number of registries with standardized metadata requirements, similar to how container image distribution settled on a handful of registries with the OCI standard.

Second, schema enforcement tightening. Right now, you can ship an MCP tool with a parameter called data of type any and no description. That will not last. As agents become the primary consumers of tool metadata, catalogs and platforms will require complete schemas with descriptions, constraints, and examples as a condition of listing. The tools that already meet this bar will have an advantage.

Third, contract testing as a norm. The MCP specification will likely add formal mechanisms for tool versioning and breaking change detection. Teams that build contract testing into their CI pipelines now will be ready. Teams that do not will discover the hard way that an untested schema change at 2 AM on a Friday breaks every agent workflow that depends on their tool.

The opportunity for tool developers is real and immediate. Agents need tools. The protocol exists. The SDKs are mature. The gap is not in technology -- it is in the discipline of treating a function as a product. The developers who close that gap first will own the defaults in a market that is still being defined.

Key Takeaways

The gap between a working script and an agent-grade MCP tool is defined by five properties: schema completeness, error taxonomy, idempotency, pricing metadata, and examples. Each one eliminates a category of ambiguity that agents cannot resolve on their own.
Tool descriptions and error messages are the primary interface for agents -- they need to be written for software that takes everything literally, not for humans who can infer context. Front-load critical information, specify limitations, and make errors actionable.
Contract testing, failure mode testing, and the "ugly test" (forcing a timeout mid-operation to verify no duplicated side effects) are what separate demo-quality tools from production-grade products. Build these into CI before shipping.

Ready to package your MCP tools for agent discovery and autonomous use? Explore AgentPMT to list your tools in the DynamicMCP marketplace, integrate x402Direct payments, and give operators the controls they need to deploy your tools in production agent workflows.

Sources

MCP Docs - Build a Server - modelcontextprotocol.io
MCP Docs - Inspector Tool - modelcontextprotocol.io
MCP Docs - Server Concepts - modelcontextprotocol.io
MCP Docs - Connect Remote Servers - modelcontextprotocol.io
MCP Python SDK - github.com
MCP TypeScript SDK - github.com
Top 5 MCP Server Best Practices - docker.com
MCP Tool Descriptions: Overview, Examples, and Best Practices - merge.dev
InfoQ - Azure DevOps MCP Server GA - infoq.com