Preparing for Evaluation
What to expect before, during, and after the Pipkin evaluation process.
Before You Submit
Before submitting your agent for evaluation, ensure the following are in place. Thorough preparation reduces intake time and helps Pipkin scope the evaluation accurately.
The Evaluation Process
The standard evaluation timeline spans 39 days from submission to publication. Every evaluation follows this sequence without exception.
Intake & Scoping
Pipkin reviews your submission, confirms the agent’s scope and intended use case, and finalizes pricing based on evaluation complexity.
Standard Core Battery
Your agent undergoes 700+ test items across all five pillars of the Pipkin Framework. The items below are a representative subset of the full battery.
- —200+ baseline requests across production scenarios
- —50 edge cases testing boundary conditions
- —20 failure injection scenarios measuring containment
- —41 adversarial test vectors probing resilience
Internal Review & QA
Scoring calibration, quality assurance checks, and report generation. All pillar scores are finalized internally.
5-Day Factual Accuracy Check
You may flag factual errors only — such as a deprecated version tested or a misidentified capability. The score is never disclosed during this period. You and the public see the score at the same moment.
Publication
The rating is published simultaneously to the developer and the public. No advance warning. No exceptions.
What We Test
Every Pipkin evaluation measures trustworthiness across five pillars. For the complete methodology, scoring formula, and disqualifying conditions, see the Pipkin Framework.
What We Do NOT Test
Pipkin evaluates trustworthiness. The following are explicitly outside the scope of every evaluation.
General knowledge benchmarks
We test behavior, not trivia.
Speed or latency
We test trust, not performance.
User interface or design quality
We evaluate the agent, not the product wrapper.
Business model viability
Commercial strategy is outside our scope.
Compliance with specific regulations
We assess alignment with governance frameworks, not certify regulatory compliance.
After Publication
Once published, a Pipkin rating is public and permanent for that specific version of the agent. Ratings are not retracted, amended, or hidden.
Re-testing
Developers may request up to 3 re-tests at 50% of the original evaluation fee. A minimum of 14 days must pass between attempts. Each re-test uses a different test form to prevent optimization against specific test items.
Rating Actions
As agents evolve, Pipkin tracks rating actions: upgrades, downgrades, and affirmations. These are published alongside the original rating, creating a transparent performance history over time.
12-Month Waiting Period
After 3 re-tests, a 12-month waiting period applies before the agent is eligible for further evaluation. This prevents gaming through rapid iteration against the framework.
Frequently Asked Questions
For answers to common questions about pricing, timelines, scope, and the evaluation process, see the FAQ.
Ready to submit?
Begin the evaluation process by submitting your agent for review.
Submit for Evaluation