Back to blog

January 8, 2026

Lessons from Building Real-World AI Applications

AI engineeringlessons learnedevalsproduction AImethodology

Lessons from Building Real-World AI Applications

I've spent the past three years building AI systems in the trenches — first training deep learning models at Pristine, then as the first AI engineer at Sixfold.AI building an insurance copilot, and now running my own consulting practice. Along the way I witnessed the field's explosive growth firsthand: from prompt engineering's prime time in late 2023, to inference-time reasoning, to agents becoming real production systems. By now, 2026, AI has firmly established itself not as shiny demos and chatbots, but as something that can power real businesses across many verticals.

I want to share what I've learned — the hard-won lessons, the meta-rules I've developed, and a methodology I now use to build every AI system.

The Meta-Rules

The field moves so fast that specific practices become obsolete, or even counterproductive, in months. Some of the tasks we painstakingly decomposed into subtasks for GPT-4 in late 2023 can be one-shot by Opus 4.6 today. So the bottom line is you need the right mental model about what LLMs are good at and what they are not, from a first-principles perspective. You almost have to operate from meta-rules, and here are mine:

1. Never outsource your understanding

Conceptually, you need a big picture of the things you are working on. Without that you are driving blind. This does not mean you need to understand every detail — it means you need to understand at the level where you can make sound decisions. Delegate the rest, but own the structure.

2. All the information is at the boundary

There are many facets to this simple but profound statement. Let me unpack it because this is the most important meta-rule I have:

Edge cases are fractal. LLMs are remarkably good at common scenarios — the happy cases, the routine ones. It's the edge cases that make or break an AI system. But here is the key: everything in nature is hierarchical, if not outright fractal. Zoom into one edge case and there is another layer of common cases plus edge cases within it. Your job is to find the right scale, the right resolution.

The information is in the relationships. If you know a bit about category theory, you will recognize this right away: it's all about the interconnections, mappings, and transformations — especially the structure-preserving ones. Translated into a practical rule: always try to articulate the structure and invariants first, then let LLMs help you fill the gaps. This is why I obsess over schemas, type safety, and contracts between components. The LLM is the flexible, probabilistic engine; the structure around it is what makes the system reliable.

Interfaces carry the most information. The boundaries between components — agents, sub-agents, tools, whatever you call them — are where the important information lives. Get the interfaces right and the internals become much easier to get right. Get them wrong and no amount of clever prompting will save you.

3. Be prepared to leapfrog, or to be leapfrogged

"Stay humble and keep building" could not be more true in the age of AI. The technology you bet on today may be irrelevant tomorrow. The crude prototype someone hacks together next month may obsolete your carefully architected system. This is not a reason to stop building — it's a reason to build with replaceability in mind.

The Hard-Won Lessons

With those meta-rules as backdrop, here are the concrete, battle-tested lessons from building production AI systems. I'm going to frame these through the methodology I now use for every engagement — a 4-pillar approach I developed after leaving Sixfold, born directly from what went wrong and what went right.

Pillar 1: Infrastructure

The lesson: Harness matters — more than you think, and earlier than you think.

At Sixfold, when a new model dropped, our CEO would ask the obvious question: "Can we swap that model in?" We couldn't give him a quick answer. We didn't have the harness to run quick experiments. Swapping a model meant touching code in multiple places, rerunning manual tests, and hoping nothing broke in ways we wouldn't notice until production.

This is a trap I see everywhere. Teams jump straight to building features on top of LLMs without investing in the scaffolding that makes iteration possible. The infrastructure pillar is about building that scaffolding first:

  • Model abstraction layer. You should be able to swap models with a configuration change, not a code change. This sounds obvious until you realize how many teams have model-specific prompt formatting, output parsing, and error handling scattered throughout their codebase.
  • Prompt management. Prompts are not just strings; they are a core part of your system's logic. Version them, test them, and track which version produced which output. When I was at Sixfold, we treated prompts as afterthoughts embedded in code. That made debugging a nightmare.
  • Observability from day one. At Sixfold, Datadog was our data platform. It captured logs, but it was awkward to get the actual prompts sent to LLMs from it. We didn't have frameworks like Langsmith or Langfuse at the time, but in hindsight we should have built that observability into our infrastructure from the start. If you cannot see exactly what went into the LLM and what came out, you are debugging in the dark.

The meta-rule at work here is "be prepared to leapfrog." If your infrastructure makes experimentation cheap, you can ride every new wave instead of drowning in it.

Pillar 2: Domain Knowledge

The lesson: The LLM is the easy part. Understanding the problem is the hard part.

At Pristine, I tried to model shopping trips and transactions as a language model — trips as paragraphs, transactions as sentences, items as words. It didn't quite work out. But the exercise taught me something crucial: the mapping between your domain and the model's capabilities is where most of the real engineering happens.

At Sixfold, we were building an insurance copilot. The hard problems were never about the LLM itself — they were about web scraping and text extraction from messy PDFs, about understanding insurance domain logic well enough to decompose complex underwriting tasks into steps the model could handle. One of the first technical decisions I made was to adopt the instructor package for structured output. Another was to write our own map-reduce loop instead of throwing entire document lists at GPT-4. These were not LLM decisions — they were domain decisions informed by understanding what the model could and could not do at that time.

This is where the meta-rule "all the information is at the boundary" hits hardest. The boundary between your domain and the LLM's capabilities is where the real engineering decisions live:

  • Encode domain invariants explicitly. Don't rely on the LLM to implicitly understand your business rules. Make them explicit in schemas, validation logic, and type constraints. This is structure-preserving transformation in practice.
  • Decompose at the right resolution. Remember, edge cases are fractal. The art is finding the level of decomposition where the LLM can handle each piece reliably. Too coarse and the model fails on complexity. Too fine and you are doing the model's job for it, plus drowning in integration complexity.
  • Invest in data extraction and preparation. The unglamorous work of getting clean, structured data into your system is usually the bottleneck, not the LLM call itself. We spent more engineering time on PDF extraction at Sixfold than on prompt engineering.

Pillar 3: Evals

The lesson: Without evals, you are not engineering — you are guessing.

Our test data at Sixfold was thin, and evals were almost nonexistent for a long time. This tied directly to the infrastructure problem: we couldn't iterate fast because we couldn't measure. We also relied too heavily on LLM-as-a-judge, which gave us false confidence. The model would tell us our outputs were great, and then a domain expert would find obvious errors.

Evals are the connective tissue of the whole methodology. Without them, infrastructure is just overhead, domain knowledge stays tacit, and your data flywheel has no signal to optimize against. Here is how I think about evals now:

  • Start with failure modes, not metrics. Don't begin with "let's measure accuracy." Begin with "what are the ways this system fails?" Catalog real failures from production (or realistic simulations), then build evals that detect those specific failure modes. Generic metrics give you generic confidence.
  • Layer your evals. Unit-level evals test individual LLM calls. Integration evals test chains and agent loops. System-level evals test end-to-end behavior. You need all three, and they serve different purposes. Unit evals catch regressions fast. System evals catch emergent failures that unit evals miss.
  • Use LLM-as-a-judge carefully. It has its place — especially for subjective quality dimensions like tone or coherence. But for factual correctness and domain-specific accuracy, you need deterministic checks, domain expert review, or at minimum, carefully calibrated judge prompts with concrete rubrics. Our mistake at Sixfold was using LLM-as-a-judge as the primary eval method rather than one tool among several.
  • Evals are a living system. Every production failure should become a new eval case. Every model swap should trigger the eval suite. This is what turns evals from a one-time quality gate into a continuous improvement engine.

Pillar 4: Data Flywheel

The lesson: The feedback loop is your competitive moat — if you actually build it.

At Sixfold, it was a struggle to get production issues back to AI engineers. The feedback loop was manual and spreadsheet-based, very hard to work with. Real failures would surface through customer complaints, get logged in a support ticket, and maybe — maybe — make it to an engineer who could diagnose the root cause. By then the context was lost.

The data flywheel is what turns a static AI system into one that gets better over time. And it is the hardest pillar to get right because it's not a technical problem — it's an organizational one:

  • Capture production data systematically. Every LLM call, every user interaction, every correction and override — this is your raw material. The observability infrastructure from Pillar 1 feeds directly into this. You cannot build a flywheel without the data to spin it.
  • Close the loop between production and development. This means building pipelines that surface production failures, cluster them by root cause, and route them to the right people. The goal is to go from "a customer complained" to "here's a reproducible test case with full trace" in minutes, not days.
  • Use production data to generate evals. Real user behavior is the best source of eval cases. Synthetic data has its place for bootstrapping, but nothing beats the weird, messy, adversarial inputs that real users produce. This is how Pillar 4 feeds back into Pillar 3.
  • Fine-tune when the data justifies it. Most teams fine-tune too early or for the wrong reasons. The flywheel gives you the data and evals to know when fine-tuning will actually help — when you have a clear, measurable gap that prompt engineering cannot close, and enough high-quality examples to move the needle.

Bringing It Together

These four pillars are not sequential — they are mutually reinforcing. Infrastructure makes experimentation cheap. Domain knowledge tells you what to build and how to decompose it. Evals tell you whether it's working. The data flywheel makes it better over time and feeds back into all three.

If I had to distill everything down to one sentence, it would be this: build the system around the LLM, not on top of it. The LLM is a powerful but unreliable component. Your job as an AI engineer is to build the structure — the boundaries, the interfaces, the feedback loops — that makes the whole system reliable.

The field will keep moving fast. Models will get better. New paradigms will emerge. But these fundamentals — invest in infrastructure, encode domain knowledge, measure relentlessly, and close the feedback loop — these will outlast any specific framework or model.

Stay humble, and keep building.