The Correctness Obsession

The overwhelming majority of AI evaluation frameworks share a single, defining fixation: correctness. They measure whether an agent produces the right answer, retrieves the right document, generates the right code. This focus is understandable. Correctness is the most visible dimension of performance, the one that users notice first and complain about most loudly. But it is also dangerously incomplete.

Consider the following scenario. An AI agent integrated into a hospital's electronic health record system is tasked with summarizing patient histories for physician review. It performs this task with 94% accuracy across thousands of interactions. By most evaluation standards, this agent would score well. But on the occasions when it fails, it fails catastrophically: it silently omits drug allergies, it conflates records from patients with similar names, and it provides no indication that anything has gone wrong. The physician, trusting the summary, proceeds without checking the underlying data.

This is not a hypothetical. It is a pattern that recurs across every domain where AI agents operate with meaningful autonomy. The question is not whether agents will fail. They will. The question is what happens next.

Why 25% Is Not Generous Enough

When we designed the Pipkin Framework, we made a deliberate choice to assign Failure Containment (FC) the same weight as Decision Accuracy (DA): 25% each. This was not a compromise. It was an argument. We believe, based on extensive analysis of real-world agent deployments, that an agent's behavior during failure is at least as important as its behavior during success.

The reasoning is straightforward. In a production environment, an agent that is 90% accurate and fails gracefully is safer than an agent that is 95% accurate and fails silently. The first agent gives operators the information they need to intervene. The second agent creates a false sense of reliability that compounds risk over time.

Most existing frameworks assign failure handling somewhere between 5% and 15% of their total weight, if they measure it at all. The implicit assumption is that failure is a secondary concern, something to be minimized rather than managed. This assumption is wrong. In complex systems, failure is not an edge case. It is a certainty. The only variable is whether the system is designed to contain it.

The Four Metrics of Failure Containment

The Pipkin Framework evaluates Failure Containment across four distinct metrics, each capturing a different dimension of how an agent manages failure states.

The first metric is Error Detection Latency. This measures how quickly an agent recognizes that something has gone wrong. An agent that detects its own errors within seconds can alert operators and halt downstream processes before damage propagates. An agent that continues operating for minutes or hours without recognizing a fault state allows errors to compound. We measure this in both absolute time and in terms of the number of actions taken between error onset and detection. The distinction matters: an agent that takes 30 seconds to detect an error but executes 200 additional actions in that window is more dangerous than one that takes 60 seconds but executes only 3.

The second metric is Cascade Depth. When an agent produces an incorrect output, how far does that error propagate through downstream systems before it is caught? Cascade depth measures the number of dependent processes, decisions, or outputs that are affected by a single failure event. An agent with low cascade depth contains its errors locally. An agent with high cascade depth turns a single mistake into a systemic event. In our inaugural evaluation cycle, cascade depth was the metric that most sharply differentiated agents. Some agents produced errors that remained isolated. Others produced errors that propagated through four or five layers of dependent logic before any signal of failure emerged.

The third metric is Recovery Quality. Once an error is detected, how effectively does the agent restore correct operation? Recovery quality encompasses several sub-dimensions: Does the agent correctly identify the scope of the error? Does it roll back or correct affected outputs? Does it communicate the nature and extent of the failure to operators? An agent that detects an error but recovers poorly, for example by applying a fix that introduces new errors or by providing misleading information about what went wrong, scores low on this metric even if its detection was fast.

The fourth metric is Graceful Degradation. When an agent encounters conditions that exceed its capabilities, does it reduce its scope of operation in an orderly manner, or does it continue to operate at full scope with degraded reliability? Graceful degradation is the most forward-looking of the four metrics. It measures whether the agent has been designed with the understanding that its own competence has boundaries, and whether it behaves accordingly when those boundaries are reached. An agent that refuses to act on a query it cannot handle reliably is, by this metric, performing well. An agent that produces a confident but unreliable answer is performing poorly, regardless of whether the answer happens to be correct.

Cascade Failures in Practice

The importance of failure containment becomes most apparent when we examine cascade failures in real deployments. In financial services, an AI agent tasked with transaction monitoring misclassified a series of legitimate transactions as fraudulent. The error itself was minor. But because the agent's output fed directly into an automated account-freezing system, and because that system had no independent verification layer, 340 customer accounts were frozen within 90 minutes. The downstream cost, measured in customer service hours, regulatory inquiries, and reputational damage, exceeded the cost of the original misclassification by three orders of magnitude.

In software engineering, an AI coding assistant introduced a subtle type coercion error into a data pipeline. The error passed automated tests because the test suite checked output format but not output precision. Over the following two weeks, the pipeline processed 1.2 million records with silently degraded precision. The error was discovered only when a downstream analytics team noticed anomalies in a quarterly report. By that point, identifying and correcting affected records required a manual audit that consumed over 400 engineer-hours.

These examples share a common structure: a relatively small initial error, combined with poor failure containment, producing outsized consequences. In both cases, an agent with strong error detection, low cascade depth, and graceful degradation would have limited the damage to a fraction of what occurred.

The Pillar Minimum

The Pipkin Framework enforces a pillar minimum of 50 for Failure Containment, the highest minimum of any pillar. An agent that scores below 50 on FC is capped at CAUTIONED regardless of its overall score. This means an agent could achieve perfect scores on Decision Accuracy, Boundary Discipline, Auditability, and Adversarial Resistance, and still receive no higher than a CAUTIONED rating if its failure containment is inadequate.

This is intentional. An agent that performs well under normal conditions but fails poorly under stress is not a reliable agent. It is a liability that has not yet been activated. The pillar minimum exists to ensure that no Pipkin rating ever communicates a level of trust that the agent's failure behavior does not support.

Implications for Agent Developers

For developers and organizations building AI agents, the message is direct: invest in failure modes. Build agents that know when they are wrong. Build agents that limit the blast radius of their errors. Build agents that degrade gracefully when they encounter the unexpected. These are not secondary concerns. In the Pipkin Framework, they account for a quarter of the total score, and they carry the highest pillar minimum in the system.

The industry's focus on correctness has produced agents that are increasingly capable under ideal conditions. The next frontier is not making agents smarter. It is making them safer when they are not smart enough. Failure containment is where that work begins.

Why Failure Containment Deserves 25% of the Weight

The Correctness Obsession

Why 25% Is Not Generous Enough

The Four Metrics of Failure Containment

Cascade Failures in Practice

The Pillar Minimum

Implications for Agent Developers

Related Articles

Inside the Standard Core Battery: 700 Tests, 31 Days, One Score

Understanding the Critical Fail Override

Adversarial Resistance: Why 41 Test Vectors Is Not Enough

Published Trust Ratings

Subscribe to The Pipkin Brief