Testing AI Agents: The E2E Harness

The hardest part of building AI agents is not getting them to work. It is getting them to work reliably, and then proving it. Here is a practical methodology for end-to-end testing that uses one AI as a sparring partner for another.

We have spent the past several months building domain-specific agent tasks — agents that write structured XML describing video, audio, and image generation pipelines. The agents compose media assets, orchestrate rendering steps, and produce output that downstream systems consume. It is a constrained domain with clear correctness criteria, which makes it an ideal laboratory for studying agent behavior.

Early on, we adopted a pattern that has proven remarkably effective: lint-like rules that run at execution time, giving the agent immediate feedback when it drifts outside the bounds of acceptable output. Think of it as guardrails with real-time correction. The agent proposes, the validator disposes, and the agent tries again within the same turn.

This works. But it raised a harder question: how do you systematically improve the skeleton that the agent operates within — the prompts, the tool definitions, the validation rules themselves?

The Strandbeest Problem

I used the Strandbeest analogy to describe how software is the skeleton that channels an agent's search energy into coherent behavior. The wind (the model) provides intelligence; the skeleton (prompts, tools, constraints) determines the gait.

But the Strandbeest only works because Jansen spent decades refining each linkage ratio. In AI terms: prompt engineering is skeleton design. You try a phrasing, observe behavior, adjust, iterate. Effective but slow — a single engineer can hold maybe three or four prompt variants in their head at once.

What if the cost of designing and testing skeletons dropped to nearly zero?

Claude Code as Sparring Partner

The key insight is that Claude Code — or any sufficiently capable coding agent — can serve as an automated sparring partner for your domain-specific agent. Not replacing human judgment, but multiplying it.

The methodology is straightforward:

Claude Code reads your agent's specification — the prompts, the tool schemas, the validation rules, the domain documentation.
Claude Code writes E2E test cases — complete scenarios that exercise the agent's capabilities, including edge cases a human might not think to test.
You manually verify the test cases — this is the crucial human-in-the-loop step. The tests need to be reasonable before you trust them.
You launch parallel runs — dozens of variations, each with slightly different prompts or tool configurations, all graded by the same test harness.

The result is a tight feedback loop that crystallizes effective configurations through sheer volume of experimentation. It is like dynamic programming: you search the space of sub-problems ahead of time, record the optimal solutions, and compose them into a final answer.

The Scaffolding

What infrastructure does this actually require? Less than you might think. Here is the minimal set:

1. API access to your agent. You need to be able to invoke your agent programmatically and capture the full response, including any tool calls. 2. Execution trace observability. You need to see not just the final output but the intermediate steps — which tools were called, in what order, what was passed to each one, and how many tokens were consumed. 3. A response mechanism for interactive agents. When your agent pauses to ask a clarifying question, something needs to answer. This can be the sparring agent itself (acting as a simulated user) or a simple rule-based responder. Think of it as a ball machine in tennis practice — it does not need to be smart, it just needs to keep feeding balls. 4. A grading function. Something that takes the agent's output and produces a score or pass/fail judgment.

What the Grader Looks Like

The grader is where domain knowledge lives. It is the function that decides whether the agent's output is good. For our media pipeline use case, a grader checks both correctness (valid XML, schema compliance, required elements) and efficiency (token usage relative to budget). This dual focus is important. An agent that produces correct output but burns three times the expected token budget is not a good agent — it is an expensive one.

Comparing Across Dimensions

Once the harness is running, comparisons become trivial. Want to know whether Sonnet outperforms Haiku on your specific task? Run both with identical prompts and graders. Want to test whether a more verbose system prompt improves output quality? Run the variants in parallel and compare scores.

We found that the most valuable comparisons are often surprising. Small wording changes in tool descriptions sometimes matter more than switching models. Adding a single constraint to the system prompt — "always validate your XML before returning" — improved pass rates by 20% on one task, while a complete rewrite of the prompt had negligible effect on another.

You cannot discover these things through intuition alone. You need the volume of experimentation that automation provides.

The Meta Question: Testing the Tester

There is a recursive problem lurking here, and it is worth addressing head-on: how do you test the testing tool itself?

The danger is convergence toward overfitting. Each round of harness development risks building a test suite that is exquisitely tuned to one specific agent on one specific task — a mold rather than a lathe. The mold produces one shape. The lathe produces many.

In practice, we have observed two distinct modes of test development:

Mold-casting: You are building a specific product. During development, you crystallize a highly specific set of E2E tests — tests that know about your domain's edge cases, your users' common mistakes, your pipeline's failure modes. These tests are not reusable and are not meant to be. They are a mold, shaped to the exact contours of your project. This is fine. Most of the time, this is what you need. Lathe-building: The more ambitious goal is to build a general-purpose test generation system — a tool that can rapidly produce molds for any new project. This is the meta-level. You are not testing an agent; you are building a system that can test any agent.

The honest answer is that most teams should start with mold-casting and stay there. The lathe is a research project. It is worth pursuing only if your cold-start cost is high enough to justify the investment — that is, if spinning up a new bespoke test suite for each project takes days or weeks rather than hours.

The Dynamic Programming Analogy

This is software-as-memoization applied to agent development itself. Each test run solves a sub-problem: "What is the best system prompt for task X?" or "Which tool schema produces the most reliable output for scenario Y?" The results are recorded. Over time, you accumulate a library of known-good configurations — a lookup table of optimized sub-solutions.

When you start a new project, you do not begin from scratch. You consult the table. You compose known-good pieces. And when you encounter a new sub-problem, you run the harness again to solve it, then add the result to the table. The specific code is throwaway. The pattern is durable.

Practical Recommendations

If you are building agents and have not yet invested in E2E testing, here is a concrete starting point:

Pick your highest-value agent task. The one where reliability matters most and failure is most visible.
Write three to five test cases by hand. Cover the happy path, one edge case, and one known failure mode.
Give the test cases and your agent's system prompt to Claude Code. Ask it to generate ten more test cases in the same style. Review them. Discard the bad ones.
Write a grader. It does not need to be perfect. Start with structural checks and add semantic checks as you learn what matters.
Run the harness. Compare two prompt variants against the same tests. Look at the scores. Look at the traces. You will learn something you did not expect.

The entire setup can be built in an afternoon. The insights compound over weeks. The skeleton gets better with each iteration, and the wind-walker walks a little more gracefully each time.

Testing AI Agents: The E2E Harness

#The Strandbeest Problem

#Claude Code as Sparring Partner

#The Scaffolding

#What the Grader Looks Like

#Comparing Across Dimensions

#The Meta Question: Testing the Tester

#The Dynamic Programming Analogy

#Practical Recommendations