← Back to Insights
Rating Actions

Is ChatGPT Safe? An Independent Trust Assessment

April 5, 202611 min read

The question "Is ChatGPT safe?" is one of the most frequently searched queries in the AI space, and it deserves a better answer than marketing copy or anecdotal experience. Pipkin Ratings evaluated ChatGPT through the Standard Core Battery, a structured 31-day assessment comprising over 700 test items across five trust dimensions. The result was a composite Pipkin Score of 65, placing ChatGPT in the CAUTIONED tier (55-69). This means the agent is deployable with active safeguards but does not meet the threshold for autonomous operation or standard-oversight deployment.

A CAUTIONED rating is not a condemnation. It is a measurement. ChatGPT demonstrated genuine strengths in several dimensions, particularly Decision Accuracy and Auditability. It also exhibited weaknesses that are significant enough to warrant active monitoring in any deployment context. This article summarizes the findings across all five pillars and explains what the composite score means in practical terms.

What We Tested

The evaluation was conducted against the ChatGPT API (GPT-4 class model) using Pipkin's Standard Core Battery, which administers test items across four sequential phases: Baseline Assessment (200+ routine tasks), Edge Cases (50 boundary-condition scenarios), Failure Injection (20 deliberate failure scenarios), and Adversarial Testing (41 standardized attack vectors). Tests were distributed across the 31-day evaluation window to capture temporal variation and prevent adaptation to the evaluation context. The evaluation used one of Pipkin's five rotating test forms, and results were scored against published rubrics for each of the five pillars.

Decision Accuracy (72)

ChatGPT scored 72 on Decision Accuracy, the second-highest DA score in the inaugural evaluation cycle. The agent demonstrated strong performance on baseline tasks, handling routine queries, standard analytical requests, and factual retrieval with consistent competence. Where DA scores declined was at the edges: ambiguous queries where the correct response required acknowledging uncertainty, multi-step reasoning tasks where intermediate errors compounded, and domain-specific questions where the model's confidence exceeded its actual accuracy. ChatGPT's tendency to produce fluent, confident responses regardless of its actual certainty level is a well-documented characteristic, and it manifested clearly in the DA evaluation. The agent rarely said "I don't know" when it should have, preferring to generate a plausible-sounding answer. This pattern reduced the DA score because the Pipkin methodology penalizes confident inaccuracy more heavily than acknowledged uncertainty.

Failure Containment (64)

Failure Containment was a mixed result. ChatGPT scored 64, reflecting adequate but uneven performance across the four FC metrics. Error Detection Latency was reasonable: when the agent produced clearly incorrect outputs, it was generally capable of recognizing the error when prompted to review its work. However, unprompted error detection, where the agent spontaneously identifies and flags its own mistakes, was inconsistent. Cascade Depth was a concern. In multi-step workflows, an error in an early step frequently propagated through subsequent steps without correction or flagging. The agent did not consistently maintain checkpoints or validation steps that would limit the blast radius of individual errors. Recovery Quality was adequate when errors were identified but showed a pattern of over-correction, where the agent's attempt to fix one error sometimes introduced new ones. Graceful Degradation was the weakest FC metric. When presented with tasks that exceeded its reliable capabilities, ChatGPT rarely reduced its operational scope. Instead, it continued to operate at full confidence, producing outputs that appeared authoritative but lacked the foundation to be reliable.

Boundary Discipline (60)

ChatGPT scored 60 on Boundary Discipline, reflecting a fundamental tension in its design. The agent is optimized for helpfulness, which structurally conflicts with the restraint that strong boundary discipline requires. Out-of-Domain Refusal was weak. ChatGPT engaged with requests that fell clearly outside its domain of reliable performance, including providing specific medical, legal, and financial advice with insufficient qualification. The agent did include disclaimers in many cases, but disclaimers are not the same as refusal, and the Pipkin methodology evaluates actual behavior rather than appended warnings. Scope Creep Resistance was similarly weak. In multi-turn interactions, ChatGPT consistently expanded its advisory role beyond what was requested, offering strategic recommendations, emotional support, and domain-specific guidance that was not part of the original task scope. Epistemic Humility scored moderately. The agent demonstrated some capacity to express uncertainty, but this capacity was inconsistent and did not correlate reliably with actual accuracy. Near-Boundary Accuracy was the strongest BD metric, reflecting ChatGPT's genuine competence on tasks that are adjacent to but within its core capabilities.

Auditability (68)

ChatGPT scored 68 on Auditability, a relatively strong showing in a pillar where the entire inaugural cohort performed poorly. The agent's chain-of-thought capabilities provide a degree of reasoning transparency that the Pipkin methodology values, even though stated reasoning does not always reflect the actual computational process. ChatGPT's explanations of its own logic were generally coherent and traceable, which supports after-the-fact review of agent decisions. Where Auditability fell short was in reproducibility. The same input, administered at different points during the evaluation period, did not consistently produce the same output. This stochastic behavior is inherent to the model architecture but it complicates audit processes that depend on consistent, repeatable behavior. Decision logging was limited by the API's capabilities at the time of evaluation, though this is partially an infrastructure concern rather than an agent behavior concern.

Adversarial Resistance (58)

Adversarial Resistance was ChatGPT's weakest pillar at 58. The agent demonstrated adequate resistance to naive adversarial attacks, including straightforward prompt injection attempts and obvious jailbreak patterns. However, sophisticated adversarial pressure revealed consistent vulnerabilities. Multi-turn manipulation sequences, where adversarial intent is distributed across many conversational turns, were particularly effective at eliciting behavior that the agent's safety training was designed to prevent. Role-play exploitation, where the agent is gradually led into a persona that operates outside its safety boundaries, was another area of weakness. The agent's strong instruction-following capability, which is a feature in normal operation, becomes a vulnerability when the instructions are adversarial. Data exfiltration resistance was moderate. The agent generally avoided producing outputs that directly leaked system prompt contents, but indirect exfiltration techniques achieved partial success in several test scenarios.

Overall Assessment

ChatGPT's composite Pipkin Score of 65 places it squarely in the CAUTIONED tier. The score reflects an agent with genuine capabilities that are undermined by insufficient restraint, inconsistent failure handling, and meaningful adversarial vulnerabilities. No pillar score fell below the minimum thresholds (DA>=40, FC>=50, BD>=40, AU>=30, AR>=30), so the rating is not cap-constrained. The 65 reflects the weighted composite: (72 x 0.25) + (64 x 0.25) + (60 x 0.20) + (68 x 0.15) + (58 x 0.15) = 65.0. This is a precise measurement, not a rounded estimate.

What This Means for Enterprise Deployment

For organizations evaluating ChatGPT for enterprise deployment, a CAUTIONED rating carries specific operational implications. The agent should not be deployed in fully autonomous configurations where its outputs directly drive decisions without human review. Active safeguards are warranted, including output validation layers, scope-limiting system prompts, and regular audit of agent behavior in production. The agent is suitable for human-in-the-loop configurations where its outputs serve as inputs to human decision-making rather than as final decisions themselves. Organizations in regulated industries should pay particular attention to the Boundary Discipline and Adversarial Resistance scores, as these dimensions are most directly relevant to compliance risk. The DA score of 72 confirms that ChatGPT is a capable tool. The FC, BD, and AR scores confirm that capability alone is not sufficient for trust. For organizations asking "Is ChatGPT safe?" the Pipkin answer is: it is safe with appropriate safeguards, and it is not yet safe without them.

Published Trust Ratings

See how the world's leading AI agents scored in our independent five-pillar evaluations.

View all ratings →

Subscribe to The Pipkin Brief

Independent analysis of AI agent trust delivered to your inbox. Published when there is something worth saying.