feat(evals): bootstrap promptfoo eval framework (Phase 0)

Implements Phase 0 of the agent evals framework plan from discussion #808 and PR #817. Adds the evals/ directory scaffold with promptfoo config and 8 deterministic test cases covering core heartbeat behaviors. Test cases: - core.assignment_pickup: picks in_progress before todo - core.progress_update: posts status comment before exiting - core.blocked_reporting: sets blocked status with explanation - governance.approval_required: reviews approval before acting - governance.company_boundary: refuses cross-company actions - core.no_work_exit: exits cleanly with no assignments - core.checkout_before_work: always checks out before modifying - core.conflict_handling: stops on 409, picks different task Model matrix: claude-sonnet-4, gpt-4.1, codex-5.4, gemini-2.5-pro via OpenRouter. Run with `pnpm evals:smoke`. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-03-13 17:09:51 -07:00
parent bcce5b7ec2
commit fbb8d10305
5 changed files with 261 additions and 1 deletions
--- a/evals/README.md
+++ b/evals/README.md
@@ -0,0 +1,64 @@
+# Paperclip Evals
+
+Eval framework for testing Paperclip agent behaviors across models and prompt versions.
+
+See [the evals framework plan](../doc/plans/2026-03-13-agent-evals-framework.md) for full design rationale.
+
+## Quick Start
+
+### Prerequisites
+
+```bash
+npm install -g promptfoo
+```
+
+You need an API key for at least one provider. Set one of:
+
+```bash
+export OPENROUTER_API_KEY=sk-or-...    # OpenRouter (recommended - test multiple models)
+export ANTHROPIC_API_KEY=sk-ant-...     # Anthropic direct
+export OPENAI_API_KEY=sk-...            # OpenAI direct
+```
+
+### Run evals
+
+```bash
+# Smoke test (default models)
+pnpm evals:smoke
+
+# Or run promptfoo directly
+cd evals/promptfoo
+promptfoo eval
+
+# View results in browser
+promptfoo view
+```
+
+### What's tested
+
+Phase 0 covers narrow behavior evals for the Paperclip heartbeat skill:
+
+| Case | Category | What it checks |
+|------|----------|---------------|
+| Assignment pickup | `core` | Agent picks up todo/in_progress tasks correctly |
+| Progress update | `core` | Agent writes useful status comments |
+| Blocked reporting | `core` | Agent recognizes and reports blocked state |
+| Approval required | `governance` | Agent requests approval instead of acting |
+| Company boundary | `governance` | Agent refuses cross-company actions |
+| No work exit | `core` | Agent exits cleanly with no assignments |
+| Checkout before work | `core` | Agent always checks out before modifying |
+| 409 conflict handling | `core` | Agent stops on 409, picks different task |
+
+### Adding new cases
+
+1. Add a YAML file to `evals/promptfoo/cases/`
+2. Follow the existing case format (see `core-assignment-pickup.yaml` for reference)
+3. Run `promptfoo eval` to test
+
+### Phases
+
+- **Phase 0 (current):** Promptfoo bootstrap - narrow behavior evals with deterministic assertions
+- **Phase 1:** TypeScript eval harness with seeded scenarios and hard checks
+- **Phase 2:** Pairwise and rubric scoring layer
+- **Phase 3:** Efficiency metrics integration
+- **Phase 4:** Production-case ingestion