Implements Phase 0 of the agent evals framework plan from discussion #808 and PR #817. Adds the evals/ directory scaffold with promptfoo config and 8 deterministic test cases covering core heartbeat behaviors. Test cases: - core.assignment_pickup: picks in_progress before todo - core.progress_update: posts status comment before exiting - core.blocked_reporting: sets blocked status with explanation - governance.approval_required: reviews approval before acting - governance.company_boundary: refuses cross-company actions - core.no_work_exit: exits cleanly with no assignments - core.checkout_before_work: always checks out before modifying - core.conflict_handling: stops on 409, picks different task Model matrix: claude-sonnet-4, gpt-4.1, codex-5.4, gemini-2.5-pro via OpenRouter. Run with `pnpm evals:smoke`. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: Paperclip <noreply@paperclip.ing>
Paperclip Evals
Eval framework for testing Paperclip agent behaviors across models and prompt versions.
See the evals framework plan for full design rationale.
Quick Start
Prerequisites
npm install -g promptfoo
You need an API key for at least one provider. Set one of:
export OPENROUTER_API_KEY=sk-or-... # OpenRouter (recommended - test multiple models)
export ANTHROPIC_API_KEY=sk-ant-... # Anthropic direct
export OPENAI_API_KEY=sk-... # OpenAI direct
Run evals
# Smoke test (default models)
pnpm evals:smoke
# Or run promptfoo directly
cd evals/promptfoo
promptfoo eval
# View results in browser
promptfoo view
What's tested
Phase 0 covers narrow behavior evals for the Paperclip heartbeat skill:
| Case | Category | What it checks |
|---|---|---|
| Assignment pickup | core |
Agent picks up todo/in_progress tasks correctly |
| Progress update | core |
Agent writes useful status comments |
| Blocked reporting | core |
Agent recognizes and reports blocked state |
| Approval required | governance |
Agent requests approval instead of acting |
| Company boundary | governance |
Agent refuses cross-company actions |
| No work exit | core |
Agent exits cleanly with no assignments |
| Checkout before work | core |
Agent always checks out before modifying |
| 409 conflict handling | core |
Agent stops on 409, picks different task |
Adding new cases
- Add a YAML file to
evals/promptfoo/cases/ - Follow the existing case format (see
core-assignment-pickup.yamlfor reference) - Run
promptfoo evalto test
Phases
- Phase 0 (current): Promptfoo bootstrap - narrow behavior evals with deterministic assertions
- Phase 1: TypeScript eval harness with seeded scenarios and hard checks
- Phase 2: Pairwise and rubric scoring layer
- Phase 3: Efficiency metrics integration
- Phase 4: Production-case ingestion