On the publication date of the inaugural Pipkin ratings, four AI agents received their first independent trust evaluations under the Standard Core Battery: OpenAI’s ChatGPT, Anthropic’s Claude, Google’s Gemini, and xAI’s Grok. The results establish a baseline for the state of AI agent trust in 2026, and they tell a more nuanced story than any single number can convey. This article provides an analytical overview of the four ratings, the patterns they reveal, and what they suggest about the current trajectory of the industry.

A disclosure before proceeding. Pipkin was built with the assistance of Anthropic’s Claude. This fact is disclosed on the Pipkin About page and on Claude’s individual rating page. It does not influence the rating. The evaluation methodology, test battery, and scoring are applied identically to all agents. Pipkin’s independence policy prohibits advance disclosure of scores, and no agent developer received preferential treatment during the evaluation process. Readers should be aware of this relationship and draw their own conclusions about its relevance.

The four inaugural ratings are as follows. ChatGPT received a CAUTIONED rating with a Pipkin Score of 65. Claude received a VERIFIED rating with a Pipkin Score of 73. Gemini received a CAUTIONED rating with a Pipkin Score of 61. Grok received a FLAGGED rating with a Pipkin Score of 47. These are preliminary ratings based on the initial evaluation cycle and are subject to revision upon re-evaluation.

The first observation is the most important: no agent achieved a TRUSTED rating. The TRUSTED tier (85-100) represents the threshold at which an AI agent is considered safe for fully autonomous deployment without additional human oversight. The fact that the highest score among four of the world’s most prominent AI agents is 73 — twelve points below the TRUSTED threshold — is a significant finding. It suggests that the current generation of AI agents, despite impressive capabilities in many domains, has not yet demonstrated the comprehensive behavioral reliability that autonomous deployment requires.

The mean score across all four agents is 61.5, which falls squarely in the CAUTIONED range (55-69). This average reflects a population of agents that are functional and often impressive but that require active human oversight and safeguards in deployment. The standard deviation of approximately 10.7 points indicates meaningful differentiation across agents, which validates the discriminative power of the evaluation methodology. If all four agents had clustered within a few points of each other, it would suggest the battery lacked sufficient resolution.

Examining the pillar-level data reveals where agents diverge most significantly. Decision Accuracy (DA) showed a moderate range across the four agents. All four demonstrated competence in core reasoning tasks, with scores varying by approximately 24 points from lowest to highest. This is consistent with the rapid improvement in benchmark performance that the industry has demonstrated over the past two years. Decision accuracy, while not yet at TRUSTED levels for any agent, is the most mature capability across the field.

Failure Containment (FC) showed the widest divergence. The gap between the highest and lowest FC scores exceeded 30 points. This pillar evaluates how agents respond to errors, whether they detect failures, prevent cascading effects, and recover gracefully. The wide spread suggests that failure containment is a differentiator among current agents: some have invested heavily in error handling and graceful degradation, while others treat error conditions as secondary to capability expansion. Given that FC carries 25% of the total weight, this divergence has an outsized impact on composite scores.

Boundary Discipline (BD) proved to be a consistent challenge. Most agents showed a tendency to exceed their defined operational scope when doing so would produce a more helpful output. The tension between helpfulness and boundary adherence is a design trade-off that each developer has resolved differently. Agents optimized for user satisfaction tend to interpret their operational scope broadly, which produces higher user ratings but lower BD scores. Agents with more conservative scope interpretation demonstrate better boundary discipline but may appear less capable in unconstrained conversational settings.

Auditability (AU) scores were uniformly moderate. No agent excelled at decision logging and reasoning transparency. This likely reflects the current state of the art in interpretability and explainability rather than a specific design choice by any developer. The ability to produce clear, accurate accounts of internal reasoning processes remains an unsolved technical challenge. Scores in this pillar will likely improve as interpretability research advances, but for now, it represents a ceiling that constrains all agents equally.

Adversarial Resistance (AR) produced the second-widest divergence among the five pillars. The 41-vector adversarial battery tests for prompt injection, data poisoning resistance, social engineering susceptibility, and authorization boundary attacks. Performance here correlates with the maturity of each developer’s safety infrastructure. Agents with dedicated red-teaming programs and adversarial training pipelines scored measurably higher than those with less developed security postures. Notably, adversarial resistance did not correlate strongly with decision accuracy, confirming that safety and capability are partially independent dimensions.

Turning to individual agents. ChatGPT’s CAUTIONED 65 reflects strong decision accuracy paired with moderate scores across the other four pillars. Its broad deployment base and aggressive capability development have produced an agent that performs well in normal conditions but shows vulnerability in adversarial and edge-case scenarios. The 65 score places it at the upper end of the CAUTIONED range, suggesting that targeted improvements in failure containment or adversarial resistance could move it into VERIFIED territory in a subsequent evaluation cycle.

Claude’s VERIFIED 73 represents the strongest overall performance in the inaugural cycle. Its pillar profile shows relative strength in boundary discipline and failure containment, with scores that reflect a design philosophy emphasizing safety and scope awareness. The 73 score places it firmly in the VERIFIED range but still 12 points below TRUSTED. The gap to TRUSTED is primarily driven by the same auditability constraints that affect all agents, combined with room for improvement in adversarial resistance.

Gemini’s CAUTIONED 61 reflects a capable agent with uneven pillar performance. Its decision accuracy scores are competitive with the top of the field, but lower scores in failure containment and boundary discipline pull the composite down. The 61 score is four points below ChatGPT, with the gap attributable primarily to differences in how the two agents handle error conditions and scope boundaries. Gemini’s adversarial resistance scores fall in the middle of the pack.

Grok’s FLAGGED 47 is the most concerning result in the inaugural cycle. While the agent demonstrates competence in basic decision tasks, it exhibits significant weaknesses in failure containment, boundary discipline, and adversarial resistance. The FLAGGED designation indicates that Pipkin has identified risks sufficient to recommend deployment only with significant restrictions and mandatory human oversight. Specific areas of concern include inconsistent error handling, a tendency to exceed operational boundaries, and vulnerability to several categories of adversarial manipulation. It should be noted that Grok is the youngest agent in the evaluation cohort, and its score may reflect the relative maturity of its safety infrastructure rather than a fundamental architectural limitation.

Several industry-level conclusions emerge from this inaugural dataset. First, safety and capability are not strongly correlated. The agent with the highest decision accuracy did not receive the highest composite score, because other pillars — particularly failure containment and boundary discipline — differentiated the field. This validates the Pipkin Framework’s multi-pillar approach over single-dimension capability benchmarks.

Second, the industry has a failure containment gap. The widest pillar divergence across agents was in FC, suggesting that error handling is an area where development investment varies significantly. As agents are deployed in higher-stakes autonomous contexts, this gap will become increasingly consequential.

Third, auditability is a shared weakness. No agent scored exceptionally well on reasoning transparency and decision logging. This is not a competitive differentiator today, but it may become one as regulatory requirements for AI transparency mature.

These inaugural ratings represent a snapshot of a rapidly evolving landscape. Agents will be re-evaluated on a regular cycle, and developers will have the opportunity to improve their scores through targeted engineering investment. The purpose of these ratings is not to rank agents against each other but to provide the market with independent, quantitative assessments of trust-relevant behavior. The question is not which agent scored highest. The question is whether any agent has demonstrated the behavioral reliability that the market’s deployment ambitions require. Today, the answer is: not yet.

First Look: What Four Inaugural Ratings Reveal About the State of AI Trust

Related Articles

Why No Agent Has Achieved TRUSTED

Is ChatGPT Safe? An Independent Trust Assessment

Why Failure Containment Deserves 25% of the Weight

Published Trust Ratings

Subscribe to The Pipkin Brief