January 11, 2026

Data Flywheels for Agentic AI: The Pillar Nobody Builds

data flywheelAI engineeringproduction AIobservabilityevals

Data Flywheels for Agentic AI: The Pillar Nobody Builds

At Sixfold, our feedback loop was a spreadsheet. A customer would flag an issue, the issue would land in a support ticket, and maybe it would reach an engineer who could diagnose the root cause. By then the context was gone. We kept making the same mistakes because our system had no memory. Every invocation started from almost blank slate.

That experience is why I now treat the data flywheel as one of my four pillars for production AI systems. It's the one that gets ignored most, and honestly the one I struggled the most.

The Core Loop

Every agent interaction generates signal: what worked, what failed, where humans intervened, what patterns recur. A flywheel turns that signal into improvements that compound over time.

The loop has three stages:

Observe. Capture everything: tool calls, reasoning steps, human edits, final outputs. Raw traces are high-volume and noisy, but they are your raw material.

Feedback → Failure pattern discovery. The part most people miss. Human corrections are not just fixes, they are labeled training data for understanding _systematic_ failure modes. At Sixfold, if I had been able to see that 40% of our corrections were the same category of mistake, that would have changed our architecture, not just our prompts. You need automated detection of recurring patterns, not just logging.

Pattern → Knowledge extraction and architecture evolution. This is where patterns become actionable. Some turn into reusable knowledge artifacts (domain heuristics, failure signatures, best practices) that get stored as versioned skills. Others trigger structural changes: new tools, modified checkpoint logic, updated eval criteria. The key is distinguishing between inner loop fixes (add a skill, update a prompt) and outer loop changes (rethink the architecture).

Here is a litmus test I tend to use: if your "flywheel" cannot tell you what categories of mistakes the agent makes most frequently and what architectural changes would address them, it's not a flywheel. It's still a log.

Why Flywheels Get Neglected

The value is invisible and delayed. You can demo an agent doing a task. You cannot demo an agent doing a task slightly better because of what it learned three weeks ago.

There is no clean abstraction boundary. A flywheel touches everything: tracing, storage, retrieval, prompt construction, evaluation, human-in-the-loop flow.

The reflection step is genuinely hard. Extracting _what happened_ from traces is straightforward. Extracting _why it failed_ and _what generalizable lesson to store_ is close to an AI-complete problem. Your insurance copilot misclassifies a risk because it treated a coastal property's flood zone designation as standard, and an underwriter corrects it. What do you store? "Property at 123 Main St is in FEMA Zone AE" is too specific, that's memorization. "Be more careful about flood zones" is too vague, that's useless. The right lesson is something like "coastal properties within 1 mile of a waterfront require explicit flood zone verification before risk scoring, even when the submission doesn't flag it." Finding that level of abstraction is domain-dependent, context-dependent, and hard to automate.

Cold start kills momentum. The flywheel is least valuable at the beginning: You pay the full infrastructure cost while getting essentially zero benefit for weeks. Your knowledge store is empty, your failure taxonomy doesn't exist yet, your triage pipeline has no baseline to filter against. You have to bootstrap all of this manually: hand-label the first batch of corrections, write the initial skills from your own domain expertise, seed the taxonomy from gut feel. There is no shortcut here. It is genuinely hard to get started, and the only way through is to accept that the first iteration will be ugly and incomplete.

What Actually Works: A Practical Stack

The teams that get this right don't treat the flywheel as a separate system. They piggyback on the human-in-the-loop flow they already need. Every human correction is already happening; the additional cost is structured capture and extraction.

Here is the baseline stack I use in my consulting work now:

Pydantic AI for agent execution and type-safe structured outputs. Langfuse for trace capture and observability. Argilla for human annotation and dataset curation. Temporal for orchestrating the pipelines: triage, extraction, pattern discovery, anything that needs durability and retries.

These cover the basics. The gaps (trace-to-annotation routing, pattern discovery, knowledge extraction, runtime retrieval) are all custom work. No off-the-shelf solution, and I don't think one will exist for a while since they are so domain-specific.

For knowledge storage, I keep it simple: versioned markdown files organized as skills. Each skill is a SKILL.md that captures domain heuristics, failure patterns, and best practices for a specific scope of the system. Living runbooks that the agent reads before execution. No vector stores, no JSONB schemas, no semantic retrieval layer. Just files in a repo, version-controlled with git, human-readable and human-editable. Sounds too simple, but frontier models digest well-formatted markdown just fine. Start with files. Add sophistication when your evals tell you to.

For the triage pipeline (deciding which traces go to human review), you filter on signals: low confidence scores, human overrides, tool errors, weird latency spikes, user thumbs-down. You need to make annotators' jobs as easy as possible.

Take home message: this is hard. You are doing R&D, not just engineering. There will be a lot of experiments, human labeling, pipelines that half-work and need rethinking. But the good news is that with models as capable as they are now, you can build those custom pieces fast. The glue code that would have taken weeks in 2023 takes days. The devil is in the details, and there are a lot of details, but the iteration speed is real. Keep building, keep iterating.

The Bottom Line

At Sixfold, not having a flywheel meant we kept solving the same problems. Now in my consulting work, I see the same pattern everywhere: teams with impressive demos that plateau because there is no mechanism for the system to learn from its own failures. Hardest of my four pillars to get right because it is not purely technical, it is organizational. But if you treat it as a natural extension of the human-in-the-loop process you already need, and start with structured capture of corrections rather than the general case, you can get things moving.

Build it in from the start. I learned that the hard way.