February 14, 2026

Evals for Agentic AI: The Pillar That Makes Everything Else Work

evalsAI engineeringproduction AIagentic AILLM-as-judge

Evals for Agentic AI: The Pillar That Makes Everything Else Work

Evals and data flywheels get lumped together constantly. I did it myself for a long time. They are related, deeply coupled even, but they are not the same thing. The way I think about it now: evals are the loss function, the flywheel is the training loop. A loss function without a training loop just tells you how bad things are. A training loop without a loss function wanders randomly. Neither is useful alone.

In my previous post I covered the flywheel. This one is about evals specifically: how to build them, how to mature them, and how to wire them to the flywheel so the whole system compounds.

Three Stages

In my experience, eval systems mature in stages. Each stage has different goals, different costs, and a different relationship with the flywheel.

Stage 1: Bootstrapping

This is the R&D heavy phase. It is manual, slow, and feels unproductive. The temptation to jump past it is strong, but in my experience the teams that shortcut this phase end up rebuilding later.

Start with traces. Get them from production if you have it. If not, produce synthetic data. Synthetic data has its limits (real users tend to be weirder and more adversarial than you might expect), but it gets you moving. At Sixfold we didn't have great production traces early on because our observability was weak. That cost us months of iteration speed.

Have domain experts annotate, open coding style. I have found it works better to not start with a predefined failure taxonomy. Let domain experts look at actual traces and categorize what went wrong in their own words. If you are building an insurance copilot, that means underwriters looking at risk assessments and flagging what the system got wrong. The taxonomy should emerge from the data, not from a whiteboard session. Your mileage may vary, but the whiteboard-first approach has burned me before.

Cluster the annotations. Once you have enough annotated traces (roughly 50 to 100 for a meaningful start, though this depends on domain complexity), group them. In my experience, failures tend to clump into a surprisingly small number of categories. Maybe the system consistently mishandles multi-location policies. Maybe it gets confused when submission documents contradict each other. These clusters become your failure modes.

Build your canonical eval set. Take the annotated, clustered traces and turn them into test cases with verified ground truth. I call this canonical rather than golden for a reason: it is authoritative at the case level (each example has been verified by a domain expert), but the set itself should be living. It will grow. New failure modes will emerge as the system evolves and users find new ways to break things. If your eval set stops growing, that is probably a sign your evals are ossifying.

Iterate until your failure taxonomy stabilizes. When new traces mostly fall into existing categories rather than spawning new ones, that is a reasonable signal to move on. If you are still discovering new failure categories every week, it may be worth staying in this stage longer. The tradeoff is real: productizing too early means you formalize an incomplete picture.

Stage 2: Productization

Now you formalize what you built by hand in stage 1.

Automate the error analysis pipeline. The manual process of pulling traces, routing them to annotators, and clustering results becomes a set of pipelines. This is where the glue code lives, and where Temporal earns its keep. Triage, extraction, routing, all orchestrated with durability and retries.

Formalize your evaluators. Some evals are deterministic: schema validation, boundary checks, format compliance. Keep those. They are fast, reliable, and cheap.

For the subjective dimensions (did the risk assessment capture the right factors? is the summary accurate?), you can distill ground-truth-based evaluators into LLM-as-judge. The calibration step here is critical though. Run your LLM judge against the human-annotated ground truth from stage 1. Measure agreement. If the LLM judge disagrees with domain experts more than domain experts disagree with each other, it probably needs more work.

At Sixfold, we relied on LLM-as-judge too early and without calibration. It gave us plausible-looking scores and we assumed it was working. It wasn't. We had false confidence for months. Lesson learned: calibrate, and re-calibrate periodically as your domain evolves and you swap models. A judge calibrated against GPT-4 traces may not hold up when you switch to Opus.

Stage 3: Ongoing

The system is now in production and the eval infrastructure is formalized. This is where evals and the flywheel start to converge.

CI/CD with the canonical eval set. Every code change, every prompt change, every model swap runs against the eval suite.

LLM judges as online monitors. A subset of your calibrated judges run against production traffic in real time. They can catch regressions and quality drift before users report them. Not perfect, but a good early warning system.

Batch scanning for new failure patterns. Periodically scan production traces (or samples) for patterns that your current eval suite does not cover. New failure modes surface here. They become new eval cases, which expand your canonical set, which catches more failures, which generates more signal for the flywheel. This is the compounding loop you want.

Sampling strategy matters. Random sampling gives you coverage of common cases but can miss rare failures. It is worth biasing your sampling toward low-confidence outputs, user corrections, and edge cases. This ties directly to the triage pipeline from the flywheel: the same signals that route traces to human annotators can also feed your eval expansion process.

Single Task vs. Multi-Step Workflows

Everything above assumes a single task: one input, one output, evaluate the result. That's the foundation, but most production agentic systems are multi-step workflows. An insurance copilot doesn't just answer a question. It reads a submission, extracts structured data, cross-references against guidelines, scores the risk, and generates a summary. Each step can fail independently, and failures in early steps cascade.

You need two lenses here: black box and white box.

Black Box: End-to-End Evals

Treat the entire workflow as a function. Input goes in, final output comes out, evaluate the result against ground truth. The three stages above (bootstrapping, productization, ongoing) apply directly. This is your primary quality signal because it measures what the user actually sees.

The tradeoff: when an end-to-end eval fails, you don't know where it failed. Was the data extraction wrong? Did it pick the right tool but parameterize it badly? Did it handle an error in step 3 by silently continuing instead of retrying? Black box evals tell you something is broken. They don't tell you what.

White Box: Evaluating the Internals

This is where you decompose the workflow and evaluate individual pieces. For a step-based workflow, that means evaluating each step's output given its input. For agentic flows with more autonomy, the components are different and worth thinking about separately:

Tool selection. Did the agent pick the right tool for the task? In our experience this was a common point of failure. An agent that calls a web search when it should query a database will produce plausible but wrong results. You can eval this with ground truth traces where the correct tool is known.

Parameterization. The agent picked the right tool, but did it call it correctly? Wrong filters, missing fields, malformed queries. At Sixfold, we ran into this with document extraction: the model would call the right API but pass parameters that silently returned incomplete data. Comparing generated parameters against known-good calls for the same input is one way to catch this, though coverage can be tricky to get right.

Error handling. What does the agent do when a tool call fails or returns unexpected results? Does it retry with adjusted parameters? Fall back to an alternative? Or hallucinate an answer and keep going? This is hard to eval with static test cases. Fault injection (deliberately breaking tool calls and evaluating recovery behavior) can help, though it adds complexity to your test infrastructure.

Context management. In multi-step flows, the agent accumulates context across steps. Is it carrying the right information forward? Is it dropping critical details? Is it getting confused by conflicting information from different steps? Checking intermediate state at each step boundary is one approach. This connects to my meta-rule about interfaces: the boundaries between steps carry the most information, and that is where measurement tends to be most valuable.

Wiring Black Box and White Box Together

In practice you run both. Black box evals are your smoke test: is the system working end-to-end? White box evals are your diagnostic tool: when it fails, where exactly did it break?

A typical workflow: end-to-end eval fails, you look at the trace, white box evals on individual steps pinpoint the failure to, say, parameterization at step

That becomes a targeted eval case and a targeted fix. Without the white box

layer, you end up staring at a failed end-to-end result trying to guess what went wrong.

One practical note: it is probably not worth building white box evals for every component on day one. Start with black box. When you see recurring failures, use the traces to identify which component is responsible, and build white box evals there. Let production failures guide where you invest. Edge cases are fractal, and you want to zoom in at the right resolution.

How Evals and the Flywheel Wire Together

The flywheel surfaces new failure patterns from production. Those patterns become new eval cases in the canonical set. The expanded evals catch more failures. More caught failures generate more signal for the flywheel. That compounding only works when both systems are operational and connected.

In my experience, teams that build evals without a flywheel end up with a static test suite that catches yesterday's bugs. Teams that build a flywheel without evals have no way to tell if their improvements are actually improving anything. Both are common failure modes.

The Bottom Line

Evals are not a one-time quality gate. They are a living system that matures through stages, starting with messy manual annotation and ending with automated monitors wired to a feedback loop. For multi-step and agentic workflows, you need both black box and white box: end-to-end to know if things work, component-level to know where they break. The bootstrapping phase is the hardest and probably the most important. Let the failure taxonomy emerge from real data. Calibrate your judges against human ground truth. And wire the whole thing to the flywheel so your eval coverage compounds alongside your system's performance.

You cannot improve what you cannot measure. But measuring alone is not enough. You need the loop.