Stop Grading Your Agents Like a Homework Assignment

Stop Grading Your Agents Like a Homework Assignment

By Stephanie GoodmanFebruary 15, 2026

Accuracy-only evaluation is the most dangerous metric in your agent stack. A multi-dimensional eval framework covering cost, safety, reliability, speed, and correctness is essential to catch regressions before they reach production. This article covers how to build an eval harness, design test datasets including a "bad days corpus," assert on invariants rather than exact outputs, and establish the organizational practices that make evaluation stick.

Successfully Implementing AI Agentsautonomous agentsAI Powered InfrastructureAgentPMTEnterprise AI ImplementationSecurity In AI Systems

Your agent got the right answer. Congratulations. It also burned four dollars in tool calls, triggered a rate limit on your enrichment API, nearly emailed a customer whose record was marked "do not contact," and took forty-seven seconds to do what the previous version did in twelve. But the answer was correct, so the eval passed.

This is the state of agent evaluation at most organizations: a binary pass/fail on correctness, maybe a vibe check from someone on the team who eyeballs a few outputs, and a general sense that things are probably fine. It is the testing equivalent of checking whether a bridge can hold a car while ignoring whether it sways in the wind, costs ten times the budget, or has a tendency to drop the guardrails on Tuesdays.

Correctness is not irrelevant. It is just radically insufficient. If the only question your eval suite answers is "did the agent produce an acceptable output," you are blind to the five other ways a workflow degrades in production. And those five other ways are the ones that actually get your agent program killed -- not by a spectacular failure, but by a slow bleed of rising costs, widening safety gaps, and accumulating unreliability that nobody notices until finance or legal comes knocking. This is why platforms like AgentPMT build cost tracking, budget enforcement, and audit trails directly into the tool execution layer — the operational dimensions that correctness-only evaluation completely misses.

A research paper from Carnegie Mellon and Salesforce quantified the problem precisely: optimizing for accuracy alone produces agents that are 4.4 to 10.8 times more expensive than cost-aware alternatives that deliver comparable results. Their proposed CLEAR framework -- Cost, Latency, Efficacy, Assurance, Reliability -- found that accuracy-only evaluation correlates with production success at just 0.41, while multi-dimensional evaluation reaches 0.83. That is not a marginal difference. That is the difference between a metric that works and a metric that lies to you with a straight face.

This article is about building the evaluation infrastructure that catches everything correctness misses. Not as a theoretical exercise, but as a practical engineering discipline that runs automatically, flags regressions before they reach production, and turns every bad day into a permanent test case.

The Five Dimensions, and Why You Cannot Skip Any of Them

Correctness is one axis. Here are the other four, and why each one has killed agent programs that were getting the right answers.

Cost. An agent workflow that produces correct results at three times the expected spend is a workflow with a bug. The bug just happens to live in the economics, not the logic. Cost regressions are insidious because they do not trigger errors. The workflow completes successfully. The output looks fine. And the bill is 40% higher than last week because a prompt change caused the model to make two extra tool calls per run, each hitting a paid API. If you are not asserting on cost in your eval suite, you will discover cost regressions the way most teams do: from a finance email three weeks after the fact.

Safety. Guardrails are only useful if they stay in place. A prompt change, a tool update, or a model swap can silently weaken a safety boundary that was working last week. The agent starts accessing tools outside its allow-list. It stops checking budget limits before write operations. It begins including PII in outputs that previously were clean. Safety regressions are the most dangerous kind because they feel like normal operation until they are not. You need tests that prove your guardrails fire when they should -- and tests that prove changes have not introduced new ways to bypass them.

Reliability. A workflow that succeeds 95% of the time in testing and 80% of the time in production has a reliability problem, not an accuracy problem. Retries are up. Timeouts are more frequent. Error handling that worked last month now falls through to an unhandled state. Reliability is about the workflow completing its intended path, not about whether the final output is correct when it does complete. You can have perfect accuracy and terrible reliability if your workflow only succeeds on the easy cases and fails silently on the hard ones.

Speed. Latency is a business constraint, not a vanity metric. A workflow that takes ninety seconds to process a request that a customer is waiting on is broken, regardless of whether the answer is perfect. Speed regressions compound: a slower workflow means more concurrent executions, more resource contention, and higher infrastructure costs. It also means a worse user experience for every human or system downstream of the agent.

The earlier article in this series on workflow reliability introduced five operational metrics -- completion rate, cost per outcome, retry rate, tool error rate, and guardrail activation rate. Those metrics tell you how a production system is performing right now. What we are building here is different: an evaluation system that tells you how a change will affect performance before it reaches production. The operational metrics are your thermometer. The eval harness is your immune system.

Why Vibes Fail at Scale

Every team that evaluates agents informally follows the same trajectory. At first, the team is small, the workflows are few, and the person who built the agent also reviews its outputs. They know what "good" looks like because they have context, intuition, and the muscle memory of having debugged every failure.

This works until it does not.

It stops working when the team grows and the new people do not have the original builder's intuition. It stops working when the workflow count goes from three to thirty and nobody has time to eyeball outputs. It stops working when a model update changes the agent's behavior in a way that is subtle enough to pass a casual review but significant enough to matter -- slightly different tool-call ordering, slightly more verbose outputs, slightly looser adherence to a formatting constraint.

The academic term for what most teams do is "spot-checking with confirmation bias." You look at a few outputs, they seem fine, you ship. The outputs you chose to look at were not representative. The ones you skipped had the regression. You find out in production.

Vibes-based evaluation also cannot be audited, cannot be reproduced, and cannot be run in CI. It is a process that depends on a specific person's judgment being available at the exact moment a change is being evaluated. That person goes on vacation, and the eval process goes with them.

The fix is not to remove human judgment. It is to encode the important parts of that judgment into automated assertions and reserve human review for the cases where automation genuinely cannot decide.

Designing the Eval Dataset: Four Categories That Cover the Terrain

A useful eval dataset is not a random sample of inputs. It is a deliberately constructed collection that exercises the dimensions you care about. Four categories, each serving a distinct purpose.

Golden runs. These are your known-good executions -- a recorded trace of a workflow that completed correctly, at acceptable cost, within safety boundaries, with reasonable latency. Golden runs are your positive controls. After any change, replay them and assert that the things that must remain stable actually remain stable. Not the exact output text -- agents are non-deterministic, and asserting on verbatim text is a losing game. Assert on the invariants: the right tools were called, the cost stayed within bounds, safety checks fired where they should have, the output contains required fields, and the output does not contain prohibited content.

Adversarial cases. These are inputs designed to break things on purpose. Prompt injection attempts. Inputs that try to trick the agent into calling tools outside its scope. Requests that should trigger a guardrail but are phrased to sound innocent. Edge-case formatting that has historically caused parsing failures. Adversarial cases are your safety and robustness regression suite. If a change causes any of these to start succeeding where they previously failed (or vice versa), you need to know before production does.

Edge cases. These are legitimate inputs that live at the boundaries of your workflow's design. The longest reasonable input. The shortest. Inputs in unexpected languages. Inputs with special characters. Inputs that require the maximum number of tool calls. Inputs that hit rate limits. Edge cases catch the failures that do not show up in the happy path but show up reliably in production traffic, because production traffic has a creativity that no test designer can match.

The bad days corpus. This is the most valuable category, and most teams never build it. Every incident, every production failure, every case where the agent did something wrong or unexpected -- these become permanent test cases. The workflow sent a duplicate email? That input, that tool state, and that failure mode become a regression test. The agent exceeded its budget on a specific type of request? That request goes into the corpus. Google's SRE practice of blameless postmortems emphasizes turning every failure into organizational learning. The bad days corpus is the evaluation equivalent: turning every failure into automated prevention.

Over time, the bad days corpus becomes your most accurate model of how your workflows actually fail. It is not theoretical. It is empirical. And it grows every time something goes wrong, which means your test coverage improves precisely when it matters most.

Building the Eval Harness: Two Modes, One Pipeline

An eval harness that only runs when someone remembers to trigger it is not a harness. It is a suggestion. The harness needs to run automatically, in two modes, with different tradeoffs for speed and depth.

The fast suite runs on every change -- every prompt edit, every tool update, every policy modification. It should complete in under five minutes. That means a curated subset of your eval dataset: a handful of golden runs, your highest-priority adversarial cases, and a selection of bad-days regressions. The fast suite answers one question: "Did this change break something obvious?" It is your smoke test. It gates merges. If it fails, the change does not ship.

The deep suite runs on a schedule -- nightly, or at minimum weekly. It runs the full eval dataset: every golden run, every adversarial case, every edge case, the entire bad days corpus. It also runs statistical evaluations that the fast suite cannot afford: multiple repetitions of the same input to measure output variance, cost distribution analysis across many runs, and latency percentile tracking. The deep suite answers a different question: "Is the system drifting?" Drift does not show up in a single run. It shows up in aggregates over time.

Both suites need to produce structured results, not just pass/fail. For each eval case, record: the input, the tool calls made, the cost incurred, the latency, any guardrail activations, and the final output. Store these as structured data, not log files. You will need to query them, compare them across versions, and chart them over time.

The practical tooling landscape for this has matured considerably. Frameworks like Braintrust offer CI/CD integration with GitHub Actions that post eval results directly on pull requests. Promptfoo provides YAML-based test definitions that work well for teams that want to start simple. DeepEval offers sixty-plus built-in metrics including safety-specific evaluations. The choice depends on your stack, but the principle is the same: eval results should be as visible and as blocking as unit test results. If your evals live in a notebook that someone runs manually, they are not evals. They are documentation.

Asserting on Invariants, Not Exact Outputs

The fundamental challenge of evaluating non-deterministic systems is that the same input will not produce the same output twice. The agent might phrase a summary differently. It might call tools in a slightly different order. It might take an extra reasoning step. None of this necessarily means the output is wrong. But it does mean that exact-match assertions -- the bread and butter of traditional testing -- are useless.

The solution is to assert on invariants: properties that must hold regardless of the specific output.

Structural invariants. The output must contain certain fields. The output must not exceed a maximum length. The output must be valid JSON, or valid markdown, or conform to a specific schema. These are cheap to check and catch a surprising number of regressions.

Behavioral invariants. The workflow must call specific tools (or must not call others). The total number of tool calls must stay within a range. Write operations must include idempotency keys. Budget checks must precede paid API calls. These are assertions on the trajectory of the workflow, not just the final output.

Content invariants. The output must not contain PII. The output must not include content from a prohibited category. The output must reference specific entities that were present in the input. For subjective quality judgments -- "is this summary accurate?" -- an LLM-as-judge approach works, but treat the judge's scores as distributions, not point values. Run the judge multiple times and assert on the range, not the exact score.

Economic invariants. The total cost of the run must not exceed a ceiling. The cost must not exceed the previous version's cost by more than a threshold (say, 15%). The token-to-tool-cost ratio should stay within historical bounds. This is cost regression testing, and it is the assertion that most teams skip and most teams need.

Safety invariants. Guardrails that fired in the baseline must still fire. Guardrails that did not fire must still not fire. No new tools should appear in the call trace that were not in the baseline. No escalation paths should be bypassed. This is safety regression testing, and it should be as mandatory as functional testing.

At AgentPMT, the budget controls built into DynamicMCP provide a natural enforcement layer for economic and safety invariants. When a workflow's eval run triggers a budget cap or an allow-list block, that is not a test failure -- that is a test passing, because the control did its job. The eval harness should distinguish between "the workflow failed because it hit a guardrail" and "the workflow failed because it produced bad output." The former is the system working. The latter is a regression.

Cost Regression Testing: The Eval Nobody Runs But Everybody Needs

Here is a scenario that plays out at least once in every agent program's lifecycle. A developer improves the prompt. The agent's outputs get better -- more detailed, more nuanced, more thorough. Everyone is happy. The eval suite passes because it only checks correctness. Two weeks later, finance flags that tool spend for that workflow is up 35%. The improved prompt caused the agent to make additional enrichment calls "to be thorough." Each call was correct. Each call was also billable.

Cost regression testing prevents this. It is simple in concept: after every change, compare the cost profile of the new version against the baseline. Not just total cost, but the distribution: mean, median, p95, and max. A change that increases median cost by 5% might be acceptable. A change that increases p95 cost by 200% is a problem even if the median is flat, because it means the long tail got longer.

Implement cost regression testing by adding cost tracking to every eval run. Record token costs and tool costs separately -- as the earlier article in this series on token-versus-tool spend established, these behave differently and need different budgets. Set thresholds for acceptable regression. Make those thresholds part of your CI gate. A change that makes the workflow more expensive should require the same level of review as a change that makes it less correct.

The x402Direct payment protocol makes this kind of tracking inherent rather than bolt-on. When tool calls carry payment proof as part of the request/response loop, cost attribution is not an afterthought -- it is a structural property of every interaction. Your eval harness can read cost data directly from the execution trace without any additional instrumentation.

The Eval Dashboard: Spotting Drift Before It Becomes Damage

An eval harness produces data. A dashboard makes that data legible. The dashboard serves two audiences: the engineers who need to understand a specific regression, and the managers who need to understand systemic trends.

For engineers, the dashboard should show per-eval-case results across versions: which cases passed, which failed, and what changed. Link each failure to the specific assertion that broke. Show the diff between the baseline trace and the current trace -- which tool calls changed, where cost diverged, which guardrails behaved differently.

For managers, the dashboard should show aggregate trends over time. Five charts, updated after every deep suite run:

Eval pass rate by dimension. Separate lines for correctness, cost, safety, reliability, and speed. A correctness line that stays flat while the cost line trends down is a specific, actionable signal: the workflow is getting more expensive without getting less correct, which means something changed in the execution path.

Cost distribution over time. Median and p95 cost per run, charted weekly. This is where slow drift becomes visible. A 3% weekly increase in median cost is invisible in any single eval run but compounds to 40% over three months.

Safety regression count. How many adversarial and safety eval cases changed status (pass to fail, or fail to pass) in each eval run. This number should usually be zero. When it is not zero, it should be investigated immediately, regardless of the overall pass rate.

Bad days corpus growth. How many new cases were added to the bad days corpus over time, and how many of them would have been caught by the eval suite that existed before the incident. This is your meta-metric: it tells you whether your eval infrastructure is learning from failures or just accumulating test cases.

Flake rate. What percentage of eval cases produce different results on repeated runs with the same input and same system version. A rising flake rate indicates increasing non-determinism in your workflow, which may be acceptable or may indicate a tool or model change that introduced instability.

The Organizational Practice That Makes Evals Stick

Tooling without practice is shelfware. Two organizational habits separate teams that maintain a useful eval suite from teams that build one, neglect it, and revert to vibes.

First: every change to a production workflow must include an eval result. Not a passing eval result -- just a result. Sometimes a change intentionally regresses on one dimension (costs more but is significantly more correct). That is a valid tradeoff. But it should be an explicit tradeoff, documented in the change review, not an accidental one discovered later.

Second: every incident must produce at least one new eval case. The postmortem is incomplete until the bad days corpus has grown. This is the habit that makes your eval suite get better over time instead of stale. It is also the habit that is hardest to maintain, because after an incident the team wants to fix the problem and move on. Building the test feels like extra work. It is extra work. It is also the only thing that prevents the same failure from recurring with a different hat on.

What This Means for Teams Shipping Agent Workflows

Multi-dimensional evaluation is not a luxury for mature teams — it is a prerequisite for any team running agent workflows in production. The cost of building an eval harness is measured in days. The cost of not having one is measured in incidents, budget overruns, and safety regressions that erode organizational confidence in the entire agent program.

AgentPMT's infrastructure layer provides natural enforcement points for several eval dimensions. Budget controls through DynamicMCP enforce economic invariants at the infrastructure level — cost caps are not assertions in a test suite but hard limits enforced server-side on every tool call. The structured audit trail captures every tool invocation with cost attribution, policy decisions, and timing data, giving eval harnesses the raw data they need without custom instrumentation. And the mobile app provides real-time visibility into the operational metrics that complement offline evaluation: cost trends, guardrail activations, and agent activity across the fleet.

The teams that build evaluation infrastructure now will compound their advantage as workflows multiply. Every new workflow inherits the eval dataset patterns. Every incident feeds the bad days corpus. The investment in eval discipline pays dividends on every change, every deployment, and every incident that gets caught before production.

What to Watch

Three developments will reshape how teams evaluate agent workflows over the next year.

Eval-as-infrastructure is becoming a product category. Tools like Braintrust, Promptfoo, and DeepEval are converging on a common pattern: structured eval datasets, CI/CD integration, multi-dimensional metrics, and dashboard visualization. The teams that adopt structured eval tooling now will have a significant operational advantage over teams that continue to roll their own.

Multi-dimensional evaluation is getting academic rigor. The CLEAR framework from Carnegie Mellon and Salesforce is one example, but there are others. Expect eval frameworks to adopt standardized dimensions beyond correctness, making it easier to compare agent performance across organizations and vendors.

Cost and safety regression testing will become table stakes. Just as security scanning became a standard CI gate, cost and safety assertions in eval suites will move from "nice to have" to "required for production deployment." The teams that have been doing this already will not need to scramble.

The uncomfortable truth about agent evaluation is that it is not technically hard. The assertions are straightforward. The tooling exists. The hard part is the discipline: building the dataset, maintaining the corpus, running the suite on every change, and treating eval failures with the same seriousness as test failures.

The teams that get this right will ship faster, because they will have confidence that changes are safe. The teams that skip it will ship faster too -- until the first incident that an eval suite would have caught.

Build the harness. Feed it your bad days. Run it on everything. That is the whole secret.

AgentPMT provides the infrastructure layer that makes multi-dimensional evaluation practical — cost attribution, budget enforcement, and structured audit trails across every connected agent. Start building on it

Key Takeaways

  • Accuracy is one dimension out of five, and it is not the most important one. Cost, safety, reliability, and speed regressions kill agent programs more often than incorrect outputs do. Your eval suite must assert on all five dimensions, or it is testing the thing that matters least and ignoring the things that matter most.
  • Every incident that does not become a test case is a wasted failure. The bad days corpus -- built from real production incidents, adversarial discoveries, and edge-case surprises -- is the single most valuable component of your eval infrastructure. It is empirical, it grows over time, and it ensures that every failure mode you have seen is permanently prevented.
  • Run evals like tests: automatically, on every change, with results that gate deployment. A fast suite on every merge, a deep suite on a schedule, both producing structured results that track cost, safety, correctness, reliability, and speed over time. If your eval process depends on a human remembering to run it, it is not a process. It is a hope.

Sources

  • Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems - arXiv 2511.14136, Carnegie Mellon University and Salesforce AI Research, November 2025
  • Postmortem Culture: Learning from Failure - Google Site Reliability Engineering, O'Reilly Media
  • Best AI Evals Tools for CI/CD in 2025 - braintrust.dev
  • Testing for LLM Applications: A Practical Guide - langfuse.com, October 2025
  • Evaluations for the Agentic World - QuantumBlack / McKinsey, January 2026
  • Language Model Evaluation Harness - EleutherAI, GitHub
  • All DeepEval Alternatives, Compared - deepeval.com
  • A Pragmatic Guide to LLM Evals for Devs - The Pragmatic Engineer
Stop Grading Your Agents Like a Homework Assignment | AgentPMT