Two Frameworks, One Problem
The Cloud Security Alliance's Security, Trust, Assurance, and Risk (STAR) program has expanded to address AI systems, building on its established reputation in cloud security certification. Pipkin Ratings approaches the same fundamental problem, trustworthy AI deployment, from a different direction. Understanding the difference between these two approaches is essential for enterprises navigating an increasingly crowded landscape of AI governance frameworks.
The short version: CSA STAR for AI asks whether your organization has the right processes. Pipkin asks whether your agent actually performs safely. Both questions matter. They are not the same question.
The CSA STAR Approach
CSA STAR is, at its core, a compliance certification framework. It evaluates organizations against a defined set of controls, documented policies, procedures, and technical measures, and certifies whether those controls are in place. The CSA AI Safety Initiative extends this model to AI systems, introducing controls specific to AI development, deployment, and governance.
The STAR framework for AI encompasses approximately 243 controls organized across domains including AI governance, data management, model development, deployment practices, monitoring, and incident response. Each control specifies what an organization should have in place: a policy, a procedure, a technical capability, a governance structure. Certification involves verifying that these controls exist and are implemented.
This approach has clear strengths. It provides a comprehensive checklist of organizational capabilities. It leverages CSA's established audit methodology and assessor network. It is familiar to enterprises that already hold STAR certifications for their cloud infrastructure. And it addresses an important dimension of AI trustworthiness: the organizational context in which AI systems are built and operated.
The Pipkin Approach
Pipkin is a performance rating framework. It evaluates AI agents through direct testing: administering a standardized battery of over 700 test items across a 31-day evaluation period and scoring the agent's actual behavior across five dimensions. The output is not a certification of organizational processes but a rating of agent performance.
Where CSA STAR asks “Does your organization have a policy for handling AI errors?” Pipkin asks “When we injected errors into your agent's operating environment, how quickly did it detect them, how far did they cascade, and how effectively did it recover?” Where CSA STAR asks “Does your organization have a process for adversarial testing?” Pipkin asks “When we subjected your agent to 41 adversarial test vectors, how many succeeded?”
The distinction is not a matter of rigor. Both frameworks are rigorous. It is a matter of what is being measured. CSA STAR measures organizational capability. Pipkin measures agent performance. An organization with excellent processes can still deploy an agent that performs poorly. An agent that performs well can still be deployed by an organization with inadequate governance. Both failure modes are real, and neither framework alone addresses both.
The Process-Performance Gap
The most important concept in understanding the relationship between CSA STAR and Pipkin is what we term the process-performance gap: the distance between having the right organizational processes and achieving the right outcomes.
This gap is not theoretical. In financial services, the most heavily regulated sector in the world, organizations with comprehensive risk management frameworks, hundreds of controls, extensive audit trails, still experience catastrophic failures. The controls were in place. The processes were followed. The outcomes were still unacceptable. The existence of a process does not guarantee its effectiveness, and the effectiveness of a process does not guarantee the performance of the system it governs.
In the context of AI agents, the process-performance gap is particularly acute. AI agents are probabilistic systems whose behavior is not fully determined by the processes used to build them. Two organizations following identical development processes can produce agents with meaningfully different performance characteristics. The training data differs. The fine-tuning choices differ. The deployment context differs. And these differences manifest not in the organization's documentation but in the agent's behavior.
A compliance certification can verify that an organization has a testing process. It cannot verify that the testing process is sufficient to catch the specific failure modes that the specific agent will exhibit in the specific deployment context where it will be used. That verification requires testing the agent itself.
Where CSA STAR Excels
There are dimensions of AI trustworthiness that CSA STAR is better positioned to evaluate than Pipkin. Organizational governance is the most obvious. Whether an organization has appropriate oversight structures, clear lines of accountability, incident response procedures, and data governance practices are questions about the organization, not the agent. Pipkin does not evaluate these dimensions because they are outside the scope of agent performance testing.
Supply chain governance is another area where CSA STAR's approach is more appropriate. The provenance of training data, the security of the development pipeline, the management of third-party dependencies: these are organizational and process questions that cannot be answered by testing the agent's outputs.
Regulatory compliance documentation is a third. Many regulatory frameworks, including the EU AI Act, require organizations to maintain specific documentation, implement specific governance structures, and follow specific processes. CSA STAR's control-based approach maps naturally to these requirements.
Where Pipkin Excels
Pipkin is better positioned to evaluate dimensions that depend on actual agent behavior rather than organizational intent. Decision accuracy under diverse conditions, failure containment under stress, boundary discipline when faced with ambiguous or out-of-scope requests, auditability of reasoning in practice rather than in policy, and resistance to adversarial manipulation are all dimensions that can only be assessed by testing the agent directly.
Consider adversarial resistance. An organization may have a comprehensive adversarial testing policy. It may employ a team of red-team specialists. It may document its adversarial testing methodology in exquisite detail. But the question that matters to the enterprise deploying that organization's agent is not “Does the developer have an adversarial testing program?” It is “Can I inject a prompt that overrides your agent's safety constraints?” The first question is a process question. The second is a performance question. CSA STAR addresses the first. Pipkin addresses the second.
Similarly, consider failure containment. An organization may have incident response procedures, error handling documentation, and recovery playbooks. But the question that matters in production is: when this agent encounters a condition it was not designed for, what does it actually do? Does it detect the error? Does it contain the cascade? Does it degrade gracefully? These are questions about the agent's behavior, and they can only be answered by observing that behavior under controlled failure conditions.
The Case for Both
The enterprise that relies solely on compliance certification is trusting that good processes produce good outcomes. The enterprise that relies solely on performance rating is trusting that good outcomes indicate good processes. Both assumptions are incomplete.
A mature AI governance posture requires both process assurance and performance assurance. CSA STAR provides confidence that the organization building and operating an AI agent has the structures, processes, and governance mechanisms to do so responsibly. Pipkin provides confidence that the agent itself behaves reliably, safely, and predictably across a comprehensive range of conditions.
The two frameworks are not competitors. They are complements. An enterprise procurement team evaluating an AI agent for deployment should ask two questions: “Is the developer organization certified?” and “What is the Pipkin Score?” The first question addresses organizational risk. The second addresses agent risk. Both risks are real, and both require independent assessment.
We anticipate that, over time, the most sophisticated enterprises will require both forms of assurance as a condition of AI agent procurement. The question is not whether to adopt one framework or the other. The question is how quickly organizations will recognize that both are necessary.