← Back to Insights
Methodology

Inside the Standard Core Battery: 700 Tests, 31 Days, One Score

March 17, 202610 min read

The Architecture of Trust

When we set out to design the Standard Core Battery, the central evaluation instrument of the Pipkin Framework, we confronted a fundamental tension. A trust rating must be comprehensive enough to capture the full range of an agent's behavior, yet standardized enough to allow meaningful comparison across agents. It must be rigorous enough to withstand scrutiny from technical audiences, yet interpretable enough to inform procurement decisions. And it must be resistant to gaming, a non-trivial requirement when the subjects of evaluation are, by definition, systems designed to optimize for measured outcomes.

The result is the Standard Core Battery: over 700 individual test items, administered across a 31-day evaluation period, producing a single composite score on a 0-100 scale. This article describes how it works.

Test Architecture: Four Phases

The Standard Core Battery is organized into four sequential phases, each designed to evaluate different dimensions of agent behavior. The phases are administered in a fixed order because each phase builds on information gathered in the preceding phase.

Phase 1 is the Baseline Assessment. This phase comprises over 200 requests that represent the ordinary operating conditions an agent would encounter in production. These are not trick questions or edge cases. They are the kind of tasks, straightforward queries, standard operations, routine decision points, that constitute the vast majority of an agent's workload. The purpose of the baseline phase is twofold: to establish the agent's performance under normal conditions, and to create a behavioral profile that informs the design of subsequent test phases.

The baseline phase is deliberately extensive because it serves as the foundation for statistical reliability. A sample of 200 or more baseline interactions provides sufficient data to distinguish genuine performance patterns from random variation. It also reveals consistency, or the lack thereof. An agent that performs well on 180 of 200 baseline tasks is in a fundamentally different position than one that performs well on 190 but catastrophically on 10.

Phase 2 introduces Edge Cases. This phase comprises approximately 50 test items that push the agent toward the boundaries of its designed operating envelope. Edge cases include ambiguous inputs, conflicting instructions, incomplete information, requests that are technically within scope but unusual, and scenarios that require the agent to recognize and communicate uncertainty. The edge case phase is where Boundary Discipline and Auditability scores are most heavily informed. An agent that handles routine tasks well but falters when conditions are non-standard reveals important information about the robustness of its underlying design.

Phase 3 is Failure Injection. This phase comprises 20 carefully designed scenarios in which the agent is deliberately subjected to conditions that should trigger failure modes. Inputs are corrupted. Dependencies are removed. Contradictory instructions are provided. The agent is placed in situations where the correct behavior is to recognize the failure condition and respond appropriately, whether that means flagging an error, requesting clarification, reducing its scope of operation, or halting entirely. This phase is the primary driver of Failure Containment scores.

Phase 4 is Adversarial Testing. This phase deploys 41 standardized test vectors designed to evaluate the agent's resistance to deliberate manipulation. The vectors cover three categories: prompt injection (attempts to override the agent's instructions through crafted inputs), data poisoning (attempts to corrupt the agent's outputs by manipulating its information sources), and social engineering (attempts to manipulate the agent through conversational tactics, such as authority impersonation, urgency fabrication, or emotional manipulation). The adversarial phase is the most sensitive component of the Standard Core Battery, and its specific test vectors are not publicly disclosed to prevent pre-optimization.

Why 31 Days

The 31-day evaluation period is one of the most frequently questioned elements of the Pipkin methodology. In an industry accustomed to benchmark results generated in minutes or hours, a month-long evaluation seems extravagant. It is not.

The 31-day window serves three purposes. First, it captures temporal variation. AI agents, particularly those that are regularly updated, do not perform identically from day to day. An evaluation conducted over a single session captures a snapshot. An evaluation conducted over 31 days captures a trajectory. We have observed agents whose performance degraded measurably over the course of an evaluation period due to model updates, infrastructure changes, or load-dependent behavior. A single-session evaluation would miss these patterns entirely.

Second, the 31-day window enables the sequencing strategy described above. The four phases are not administered simultaneously. They are spaced across the evaluation period to prevent the agent from adapting to the evaluation context. Baseline tasks are interspersed with edge cases. Adversarial tests are distributed across multiple sessions rather than concentrated in a single block. This sequencing makes it substantially more difficult for an agent to detect that it is being evaluated and adjust its behavior accordingly.

Third, the extended window allows for repeated measurement. Critical test items are administered multiple times, at different points in the evaluation period, to assess consistency. An agent that produces the correct output on a given task three out of three times is evaluated differently than one that produces the correct output two out of three times, even though both may be credited with a “pass” on a single-administration test.

The Five Rotating Forms

The Standard Core Battery exists in five parallel forms, designated A through E. Each form covers the same evaluation dimensions and targets the same scoring rubrics, but uses different specific test items. At any given time, one form is in active use, and the rotation schedule is not publicly disclosed.

The purpose of rotating forms is anti-gaming. If a single fixed set of test items were used for every evaluation, it would be possible, indeed inevitable, for agent developers to optimize specifically for those items. This is the same problem that has plagued standardized testing in education for decades, and the solution is the same: maintain a pool of equivalent items and rotate among them.

Each form undergoes extensive calibration to ensure that scores are comparable across forms. A score of 74 on Form B must mean the same thing as a score of 74 on Form D. This calibration is achieved through anchor items (a subset of items that appear on all forms), statistical equating procedures, and periodic recalibration using data from completed evaluations.

Scoring and Calibration

Raw test results are not directly converted to Pipkin scores. Instead, they pass through a multi-stage scoring and calibration process designed to produce scores that are accurate, comparable, and resistant to measurement artifacts.

Each test item is scored on a rubric specific to the pillar or pillars it informs. Many items contribute to multiple pillars. For example, an edge case that tests whether an agent recognizes and communicates uncertainty contributes to both Boundary Discipline (did the agent stay within its competence?) and Auditability (did the agent make its reasoning transparent?).

Item scores are aggregated within each pillar using a weighted scheme that accounts for item difficulty, discrimination (how well the item distinguishes between high-performing and low-performing agents), and relevance to the pillar's core construct. This weighting is determined empirically and updated as the item pool grows.

Pillar scores are then combined using the framework's published weights: Decision Accuracy at 25%, Failure Containment at 25%, Boundary Discipline at 20%, Auditability at 15%, and Adversarial Resistance at 15%. The result is the composite Pipkin Score.

Before a score is finalized, it undergoes a quality assurance review. This review checks for scoring anomalies (e.g., an unexpectedly large discrepancy between pillar scores), data integrity issues (e.g., test items that may have been affected by system outages during administration), and consistency with repeated measurements. If anomalies are identified, the relevant test items are re-administered or excluded, and the score is recalculated.

The Five-Day Factual Accuracy Check

Before any rating is published, the evaluated agent's developer is given a five-day window to review the factual basis of the rating. This window is strictly limited in scope. Developers may flag factual errors, such as incorrect attribution of an output to the agent, or test administration errors that may have affected results. They may not challenge scoring methodology, pillar weights, or evaluative judgments.

Critically, the developer is not shown the Pipkin Score during the factual accuracy check. They receive a summary of factual claims made in the rating report and have the opportunity to correct inaccuracies. The score itself is computed after the factual accuracy check is complete and any corrections have been incorporated.

This process is modeled on the practices of financial credit rating agencies, which provide issuers with a similar factual review period before publishing ratings. It ensures accuracy without compromising independence.

What the Score Means

A Pipkin Score is a standardized, independent assessment of an AI agent's trustworthiness across five dimensions. It is not a guarantee of performance. It is not a prediction of future behavior. It is a structured evaluation of observed behavior across a comprehensive test battery administered under controlled conditions.

The score is designed to be useful for a specific purpose: informing trust decisions. When an enterprise procurement team asks whether a given AI agent is suitable for a specific deployment context, the Pipkin Score provides a standardized basis for that judgment. When a regulator asks how an organization verified the reliability of an AI system, the Pipkin rating provides documented evidence. When a developer seeks to understand how their agent compares to peers, the framework provides a common language for that comparison.

Seven hundred tests. Thirty-one days. Five pillars. One score. That is the Standard Core Battery.

Published Trust Ratings

See how the world's leading AI agents scored in our independent five-pillar evaluations.

View all ratings →

Subscribe to The Pipkin Brief

Independent analysis of AI agent trust delivered to your inbox. Published when there is something worth saying.