One of the patterns I keep circling back to while building with AI is the duality of it: AI is both what I'm building into the product and what I'm building with. Those two roles tend to get discussed separately. There's the "AI-powered features" conversation and the "AI-assisted development" conversation. But in practice they're tangled together in ways that turn out to be interesting.

This is a specific example of that tangle, and what I learned from pulling the threads apart.

The Setup

The app I'm building, Qupi, is a personal health and lifestyle tracker where users log entries in plain text or voice. The core AI capability is normalization: the user types something loose and unstructured like "slept maybe 5 hours, woke up with a headache," and the AI turns that into structured data. A sleep event with a duration. A headache event with an onset time. That normalization layer is what makes the app feel different from a form-based tracker. It's arguably the most important thing in the product.

For a while, this normalization logic lived inside the app's backend, buried in an API handler alongside database calls, auth checks, and response formatting. It worked, but when the AI got something wrong, debugging meant reproducing the full app flow just to see what happened. And improving the AI's behavior meant changing the prompt, deploying the app, testing manually, and hoping. Not exactly rigorous.

The decision I'm writing about here is extracting that AI capability into a standalone, independently testable module and building a proper evaluation harness around it.

Why Extract It

The normalization layer has a remarkably clean contract: text goes in, structured data comes out. No side effects, no database writes, no user sessions. It's the kind of thing that should be independently testable almost by definition, but the convenience of building it inline (right there in the handler where it was needed) meant it never was.

Extracting it means drawing a hard boundary: the normalization module knows nothing about the database, nothing about HTTP, nothing about authentication. It receives text and context, and returns structured results. The app layer handles everything else. Fetching the user's history, calling the module, storing the output. The module doesn't know and doesn't care that it's part of a web app.

This is a familiar pattern in software (extract, isolate, test) but it takes on a different character when the thing you're extracting is an AI capability. You're not just isolating a function; you're creating the conditions to treat AI behavior as something you can systematically observe and improve.

The Evaluation Harness

This is where it gets interesting. Once the normalization module is standalone, you can build tooling around it that would have been awkward or impossible when it was embedded in the app.

The harness I'm building has three modes:

Single-input mode. Feed it one entry from the command line, get back the full structured output. Useful for quick iteration: change the prompt, run an input, see what changed.

Batch mode. Point it at a file of test cases, each one an input string paired with the expected structured output, and run them all. You get a quantitative result: 47 of 50 cases passed, here are the 3 failures with diffs showing what went wrong. This is the regression suite. Every time the AI gets something wrong in real usage, I add a test case with the correct output. The corpus grows over time, and it compounds. Each addition makes future changes safer.

Comparison mode. Run the same set of test cases through two different configurations (different prompts, different models) and diff the results side by side. This turns "is this prompt better?" from a subjective judgment into a measurable one.

Context Fixtures

One of the more useful decisions was building context fixtures into the harness. The AI's behavior depends on what it knows about the user: their established vocabulary, their logging history. A new user who types "migraine" gets interpreted differently than someone who has been logging migraines for months and has established their own patterns.

So the test cases run against named context fixtures: new-user provides only the baseline vocabulary, established-user provides a richer set of terms and history. The same input can produce different (and correctly different) outputs depending on which fixture is active. This lets me test not just whether the AI gets the right answer, but whether it adapts appropriately to different user contexts.

Match Rules

A small but important detail: test cases specify how to match expected output, not just what to match. For the event type, I use exact matching. "Sleep" must be "Sleep." For extracted data fields, I use subset matching: the expected fields must be present and correct, but the AI is allowed to extract additional fields I didn't specify. This prevents tests from breaking when the AI gets better at extracting data. The tests catch regressions without penalizing improvements.

The Duality in Practice

Here's where the two uses of AI, the product feature and the development tool, start to reinforce each other.

The evaluation harness was designed using a Claude project set up as a product and technical advisor (the "thinking agent" I wrote about in The Multi-Agent Loop). The actual implementation will be handed to Claude Code. So AI is helping me design the infrastructure that I'll use to rigorously evaluate a different AI capability.

But it goes beyond that. The harness itself becomes a discovery tool, a concrete instance of the "discovery before prioritization" pattern from my earlier observations. Running the test corpus against a cheaper, faster model doesn't just tell me whether I can save money. It tells me where the quality boundary actually is: which kinds of inputs degrade gracefully and which ones fall apart. That's information I couldn't get by theorizing about it. I have to build the tool, run the tests, and see.

The feedback loop looks something like this: use AI to design the evaluation approach, use AI to build it, use the evaluation to improve the AI capability, and use what I learn to inform the next round of design. Each layer makes the others more effective.

What This Approach Gives You

A few things I think are worth naming:

Prompt engineering becomes evidence-based. Without the harness, changing a prompt is an exercise in vibes. You try it, it seems better, you ship it. With the harness, you run the full corpus before and after, and you can see exactly what improved and what regressed. The test cases don't eliminate judgment, but they give judgment something concrete to work with.

Model selection becomes empirical. The same infrastructure that tests prompt changes also tests model changes. Run the corpus against the expensive model and the cheap one, diff the results, and make a real decision about the tradeoff. I don't have to guess whether a smaller model is good enough. I can measure it.

The corpus compounds. Every bug I find in production becomes a test case. Over time, the corpus grows into a detailed map of what the AI handles well and where it struggles. New team members (or future-me, six months from now) can look at the test cases and understand the AI's behavioral boundaries without reading every line of prompt engineering.

The module becomes reusable. This wasn't the motivation for extracting it, but once the normalization capability is a clean, standalone module with a well-defined interface, it's no longer trapped inside one app's API handler. Whether that optionality matters is an open question (for now it's just the app that uses it) but the option exists in a way it didn't before.

The Meta-Observation

The thing that keeps striking me about this process is how naturally the "building with AI" and "building AI into the product" sides feed each other. The same discipline that makes AI tools effective for development (clear contracts, structured evaluation, evidence over intuition) is exactly what you need to build AI capabilities that are reliable and improvable.

I don't think that's a coincidence. AI capabilities, whether they're in your development workflow or your product, share a fundamental property: they're non-deterministic. You can't fully predict their behavior from their inputs. That means you need observation and measurement infrastructure that traditional software (deterministic, testable by definition) never required in quite the same way.

Treating AI behavior as something you observe, measure, and refine with purpose seems to apply equally whether you're the one using the AI or your users are.