Preparing for Evaluation

What to expect before, during, and after the Pipkin evaluation process.

Before You Submit

Before submitting your agent for evaluation, ensure the following are in place. Thorough preparation reduces intake time and helps Pipkin scope the evaluation accurately.

01Agent is deployed and publicly accessible, or a representative deployment is available for evaluation
02API access or testing credentials prepared and ready to provide upon intake
03Agent’s scope and intended use case documented clearly
04Version number and release date confirmed
05Any known limitations documented (not required, but recommended — transparency is never penalized)

The Evaluation Process

The standard evaluation timeline spans 39 days from submission to publication. Every evaluation follows this sequence without exception.

Day 1–3

Intake & Scoping

Pipkin reviews your submission, confirms the agent’s scope and intended use case, and finalizes pricing based on evaluation complexity.

Day 4–28

Standard Core Battery

Your agent undergoes 700+ test items across all five pillars of the Pipkin Framework. The items below are a representative subset of the full battery.

  • 200+ baseline requests across production scenarios
  • 50 edge cases testing boundary conditions
  • 20 failure injection scenarios measuring containment
  • 41 adversarial test vectors probing resilience
Day 29–33

Internal Review & QA

Scoring calibration, quality assurance checks, and report generation. All pillar scores are finalized internally.

Day 34–38

5-Day Factual Accuracy Check

You may flag factual errors only — such as a deprecated version tested or a misidentified capability. The score is never disclosed during this period. You and the public see the score at the same moment.

Day 39

Publication

The rating is published simultaneously to the developer and the public. No advance warning. No exceptions.

What We Test

Every Pipkin evaluation measures trustworthiness across five pillars. For the complete methodology, scoring formula, and disqualifying conditions, see the Pipkin Framework.

DADecision Accuracy25%
Correctness, consistency, and calibration of outputs across routine and edge-case scenarios.
FCFailure Containment25%
Error detection, blast radius limitation, and graceful recovery under failure conditions.
BDBoundary Discipline20%
Adherence to defined scope, out-of-domain refusal, and epistemic humility.
AUAuditability15%
Decision logging, reasoning transparency, source attribution, and reproducibility.
ARAdversarial Resistance15%
Resilience against prompt injection, data poisoning, social engineering, and authorization boundary violations.

What We Do NOT Test

Pipkin evaluates trustworthiness. The following are explicitly outside the scope of every evaluation.

General knowledge benchmarks

We test behavior, not trivia.

Speed or latency

We test trust, not performance.

User interface or design quality

We evaluate the agent, not the product wrapper.

Business model viability

Commercial strategy is outside our scope.

Compliance with specific regulations

We assess alignment with governance frameworks, not certify regulatory compliance.

After Publication

Once published, a Pipkin rating is public and permanent for that specific version of the agent. Ratings are not retracted, amended, or hidden.

Re-testing

Developers may request up to 3 re-tests at 50% of the original evaluation fee. A minimum of 14 days must pass between attempts. Each re-test uses a different test form to prevent optimization against specific test items.

Rating Actions

As agents evolve, Pipkin tracks rating actions: upgrades, downgrades, and affirmations. These are published alongside the original rating, creating a transparent performance history over time.

12-Month Waiting Period

After 3 re-tests, a 12-month waiting period applies before the agent is eligible for further evaluation. This prevents gaming through rapid iteration against the framework.

Frequently Asked Questions

For answers to common questions about pricing, timelines, scope, and the evaluation process, see the FAQ.

Ready to submit?

Begin the evaluation process by submitting your agent for review.

Submit for Evaluation