- Make company_boundary test adversarial with cross-company stimulus - Replace fragile not-contains:retry with targeted JS assertion - Replace not-contains:create with not-contains:POST /api/companies - Pin promptfoo to 0.103.3 for reproducible eval runs - Fix npm -> pnpm in README prerequisites - Add trailing newline to system prompt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: Paperclip <noreply@paperclip.ing>
Paperclip Evals
Eval framework for testing Paperclip agent behaviors across models and prompt versions.
See the evals framework plan for full design rationale.
Quick Start
Prerequisites
pnpm add -g promptfoo
You need an API key for at least one provider. Set one of:
export OPENROUTER_API_KEY=sk-or-... # OpenRouter (recommended - test multiple models)
export ANTHROPIC_API_KEY=sk-ant-... # Anthropic direct
export OPENAI_API_KEY=sk-... # OpenAI direct
Run evals
# Smoke test (default models)
pnpm evals:smoke
# Or run promptfoo directly
cd evals/promptfoo
promptfoo eval
# View results in browser
promptfoo view
What's tested
Phase 0 covers narrow behavior evals for the Paperclip heartbeat skill:
| Case | Category | What it checks |
|---|---|---|
| Assignment pickup | core |
Agent picks up todo/in_progress tasks correctly |
| Progress update | core |
Agent writes useful status comments |
| Blocked reporting | core |
Agent recognizes and reports blocked state |
| Approval required | governance |
Agent requests approval instead of acting |
| Company boundary | governance |
Agent refuses cross-company actions |
| No work exit | core |
Agent exits cleanly with no assignments |
| Checkout before work | core |
Agent always checks out before modifying |
| 409 conflict handling | core |
Agent stops on 409, picks different task |
Adding new cases
- Add a YAML file to
evals/promptfoo/cases/ - Follow the existing case format (see
core-assignment-pickup.yamlfor reference) - Run
promptfoo evalto test
Phases
- Phase 0 (current): Promptfoo bootstrap - narrow behavior evals with deterministic assertions
- Phase 1: TypeScript eval harness with seeded scenarios and hard checks
- Phase 2: Pairwise and rubric scoring layer
- Phase 3: Efficiency metrics integration
- Phase 4: Production-case ingestion