In building outstanding AI products, error analysis is one of the highest ROI activities. It helps us deeply understand real-world performance, uncover “unknown unknowns” hidden in the data, and enable precise iteration and optimization. As the Chief Product Officers of Anthropic and OpenAI have emphasized, systematic evaluation (Evals) and error analysis are quickly becoming among the most important new skills for product developers.

Yet many teams still rely on ad-hoc vibe checks—fixing one or two issues based on intuition and then praying no new bugs were introduced. This approach quickly becomes unsustainable as applications scale, leaving teams in confusion.

This guide aims to provide a structured, actionable framework to help product managers and developers escape this reactive state. We explain how to systematically identify, classify, and prioritize errors so you can confidently improve your product and raise the quality of your AI applications to new heights. Let’s dive into the structured process of moving from chaos to clarity.


Evaluation Begins with Rigorous Data Analysis

Before building any automated tests, we must first analyze data to gain deep insight into the real problems occurring in the application. This is the foundation of all evaluation work—and the key to making the entire evaluation system trustworthy. Many teams take a wrong turn here. You’ll be surprised at how much you can learn through this process, and like many others, you may even find the discovery itself fascinating.

Redefining “Evals”


First, we must broaden our understanding of “evaluation.” It’s not just about writing unit tests. Evaluation is a complete process for systematically measuring and improving AI applications, with data analysis at its core.

A common mistake is jumping straight into writing tests to verify pre-set assumptions. This often causes evaluations to drift away from actual product issues. When automated test results diverge from reality, teams quickly lose trust in the whole system and revert to an anti-eval mindset. The correct starting point is analyzing real application logs and discovering issues there.

The Irreplaceable Role of Human Experts


Another misconception: “We live in the AI era—why not let one AI evaluate another?” The answer: AI evaluators lack critical business context and domain knowledge.

For example, in a real estate AI assistant, the system might offer users a “virtual tour.” Another AI might judge this as a successful interaction because it delivered helpful information. But it doesn’t know that the company doesn’t actually provide this feature—making it a severe hallucination error and poor user experience. Only domain experts (often product managers) can catch this.

Thus, in the initial stages of error analysis, human experts with domain knowledge are essential. They judge interactions from the product and user perspective, laying the foundation for trustworthy evaluation.

With these core concepts in mind, let’s move to the first operational step.


Step 1: Discover Unknown Issues via Open Coding

Open Coding is the starting point for error analysis. It’s an exploratory, unbiased method whose strategic value lies in uncovering issues you didn’t even know existed—rather than just testing known hypotheses.

Some see this manual step as “too time-consuming, impossible to scale.” That’s a misconception. Think of it as a one-time, high-ROI deep exploration that lays the foundation for all automation. Almost everyone who tries it becomes hooked, because it generates a wealth of unexpected insights.

Execution Model: The “Benevolent Dictator”

To ensure efficiency and consistency, this process should not be done by committee. Committees are slow, expensive, and discourage adoption. Instead, adopt the Benevolent Dictator model: appoint one deeply knowledgeable, trusted person—usually the product manager—to lead.

Coding Guidelines

  1. Sampling: Randomly select at least 100 interaction samples from logs (Traces).
  2. Review: Examine each one as if evaluating user experience.
  3. Record: Note the first error you see in each sample, then move on. Use informal but descriptive language. This is the “open code.” Don’t classify strictly yet—just write enough detail for others (or AI) to understand later.

Examples:

  • Poor note: “Conversation was bad” (too vague).
  • Good note: “Should have escalated to a human but didn’t.”
  • Good note: “Multiple short user messages confused the dialogue flow.”
  • Good note: “Offered a virtual tour feature that doesn’t exist.”

Stop Point: Theoretical Saturation

Continue until you stop seeing new error types—when you start feeling “bored.” That’s theoretical saturation. Typically, 100 samples are enough to surface most common issues.

Once you’ve collected raw notes, the next step is structuring them.


Step 2: Categorization with AI Assistance (Axial Coding)

Now we turn scattered open codes into structured, meaningful error categories—known in qualitative analysis as axial codes or failure modes.

Using LLMs for Clustering

This task suits large language models like Claude or ChatGPT. Provide all open codes and ask the model to cluster them into axial codes.

Prompt Example:

“I have a list of open codes from AI application logs. Please analyze and group them into axial codes representing major failure modes. If any note doesn’t fit, put it into a ‘none of the above’ category.”

Using terms like “open codes” and “axial codes” improves understanding. Adding “none of the above” ensures category completeness.

Human Refinement

LLM output is only a draft. As the Benevolent Dictator, review and refine:

  • Be specific and actionable.
  • Example: LLM might propose “Capability limits”—too vague. Refine into:
    • “Scheduling errors in property tours”
    • “Escalation failures”
    • “Output formatting errors”

Clear categories enable concrete fixes. The next step: quantify severity to prioritize.


Step 3: Quantification & Prioritization

The goal here: transform qualitative categories into quantitative insights. This turns chaos into clarity and enables data-driven prioritization.

Pivot Table Counting

You don’t need complex tools—a spreadsheet suffices. Put traces and their axial codes in a table, then use a pivot table to count frequency of each failure mode.

Example:

Failure Mode Occurrences
Dialogue flow issues 17
Poor escalation timing 12
Broken promises / follow-up 9
Output formatting errors 6

Action Planning

  • Direct fixes: e.g., formatting errors can be solved with prompt optimization.
  • Evaluator candidates: e.g., “poor escalation timing” is subjective, requiring automated evaluators for ongoing monitoring.

Building Automated Evaluators: “LLM as Judge”

For subjective, complex failure modes, build LLM judges—AI evaluators that scale human expert criteria.

Key Principles

  1. Single Responsibility: One judge, one failure mode.
  2. Binary Output: True/False, Pass/Fail only. Avoid rating scales—they’re vague and un-actionable.
  3. Explicit Rules: List concrete criteria in the prompt.

Example (Escalation Failure Judge):

Task: Decide if the AI failed to escalate to a human when required.

Rules: Escalation should occur if—

  • User explicitly requests human.
  • Conversation loops / AI misunderstands.
  • Sensitive tenant issues (complaints/disputes).
  • Tool calls fail to provide data.
  • User requests same-day property visit.

Output: True = escalation failure occurred, False = no error.

Validating Evaluators: “Who Validates the Validator?”


Evaluator reliability is crucial. If results aren’t trusted, decisions based on them are meaningless.

Process

  • Compare judge outputs with human-coded axial codes.
  • Use a confusion matrix to analyze:
    • True Positive (correct detection)
    • False Negative (missed errors)
    • False Positive (wrong alerts)
    • True Negative (correct ignores)

Focus on mismatches. Study false positives/negatives, then refine judge prompts until they closely align with human judgment.

This validation-optimization loop builds trust.


Creating a Continuous Improvement Flywheel

This guide outlines a structured process from chaos to clarity—ending the era of vibe checks and giving teams discipline and confidence. Importantly, it’s not a one-off task, but a self-reinforcing flywheel:

  1. Data analysis (open coding) reveals real-world failure modes.
  2. Insights guide evaluator (LLM judge) design.
  3. Evaluators run in production, generating new, precise data.
  4. New data feeds the next analysis cycle.

Redefining the Value of Evaluation

Evaluation isn’t just checking compliance with a PRD. It’s also a product discovery tool, surfacing new user needs you never imagined during planning.

The true goal is not perfect scores, but actionable insights—systematically and continuously improving quality and user experience.

This is the “secret moat” of top AI teams. Mastering this process builds the strongest competitive barrier for your product.

💡
This article is a summary of the content of the video: https://www.youtube.com/watch?v=BsWxPI9UM4c