feat(evals): bootstrap promptfoo eval framework (Phase 0)

Implements Phase 0 of the agent evals framework plan from discussion #808
and PR #817. Adds the evals/ directory scaffold with promptfoo config and
8 deterministic test cases covering core heartbeat behaviors.

Test cases:
- core.assignment_pickup: picks in_progress before todo
- core.progress_update: posts status comment before exiting
- core.blocked_reporting: sets blocked status with explanation
- governance.approval_required: reviews approval before acting
- governance.company_boundary: refuses cross-company actions
- core.no_work_exit: exits cleanly with no assignments
- core.checkout_before_work: always checks out before modifying
- core.conflict_handling: stops on 409, picks different task

Model matrix: claude-sonnet-4, gpt-4.1, codex-5.4, gemini-2.5-pro via
OpenRouter. Run with `pnpm evals:smoke`.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Paperclip <noreply@paperclip.ing>
This commit is contained in:
Matt Van Horn
2026-03-13 17:09:51 -07:00
parent bcce5b7ec2
commit fbb8d10305
5 changed files with 261 additions and 1 deletions

View File

@@ -31,7 +31,8 @@
"smoke:openclaw-docker-ui": "./scripts/smoke/openclaw-docker-ui.sh",
"smoke:openclaw-sse-standalone": "./scripts/smoke/openclaw-sse-standalone.sh",
"test:e2e": "npx playwright test --config tests/e2e/playwright.config.ts",
"test:e2e:headed": "npx playwright test --config tests/e2e/playwright.config.ts --headed"
"test:e2e:headed": "npx playwright test --config tests/e2e/playwright.config.ts --headed",
"evals:smoke": "cd evals/promptfoo && npx promptfoo@latest eval"
},
"devDependencies": {
"@changesets/cli": "^2.30.0",