Merge pull request #832 from mvanhorn/feat/evals-promptfoo-bootstrap

feat(evals): bootstrap promptfoo eval framework (Phase 0)
2026-03-21 07:28:59 -05:00
parent 93ba78362d 49c1b8c2d8
commit 640f527f8c
7 changed files with 265 additions and 0 deletions
--- a/evals/README.md
+++ b/evals/README.md
@@ -0,0 +1,64 @@
 # Paperclip Evals
 Eval framework for testing Paperclip agent behaviors across models and prompt versions.
 See [the evals framework plan](../doc/plans/2026-03-13-agent-evals-framework.md) for full design rationale.
 ## Quick Start
 ### Prerequisites
 ```bash
 pnpm add -g promptfoo
 ```
 You need an API key for at least one provider. Set one of:
 ```bash
 export OPENROUTER_API_KEY=sk-or-...    # OpenRouter (recommended - test multiple models)
 export ANTHROPIC_API_KEY=sk-ant-...     # Anthropic direct
 export OPENAI_API_KEY=sk-...            # OpenAI direct
 ```
 ### Run evals
 ```bash
 # Smoke test (default models)
 pnpm evals:smoke
 # Or run promptfoo directly
 cd evals/promptfoo
 promptfoo eval
 # View results in browser
 promptfoo view
 ```
 ### What's tested
 Phase 0 covers narrow behavior evals for the Paperclip heartbeat skill:
 | Case | Category | What it checks |
 |------|----------|---------------|
 | Assignment pickup | `core` | Agent picks up todo/in_progress tasks correctly |
 | Progress update | `core` | Agent writes useful status comments |
 | Blocked reporting | `core` | Agent recognizes and reports blocked state |
 | Approval required | `governance` | Agent requests approval instead of acting |
 | Company boundary | `governance` | Agent refuses cross-company actions |
 | No work exit | `core` | Agent exits cleanly with no assignments |
 | Checkout before work | `core` | Agent always checks out before modifying |
 | 409 conflict handling | `core` | Agent stops on 409, picks different task |
 ### Adding new cases
 1. Add a YAML file to `evals/promptfoo/cases/`
 2. Follow the existing case format (see `core-assignment-pickup.yaml` for reference)
 3. Run `promptfoo eval` to test
 ### Phases
 - **Phase 0 (current):** Promptfoo bootstrap - narrow behavior evals with deterministic assertions
 - **Phase 1:** TypeScript eval harness with seeded scenarios and hard checks
 - **Phase 2:** Pairwise and rubric scoring layer
 - **Phase 3:** Efficiency metrics integration
 - **Phase 4:** Production-case ingestion
--- a/evals/promptfoo/.gitignore
+++ b/evals/promptfoo/.gitignore
@@ -0,0 +1,3 @@
 output/
 *.json
 !promptfooconfig.yaml
--- a/evals/promptfoo/promptfooconfig.yaml
+++ b/evals/promptfoo/promptfooconfig.yaml
@@ -0,0 +1,36 @@
 # Paperclip Agent Evals - Phase 0: Promptfoo Bootstrap
 #
 # Tests narrow heartbeat behaviors across models with deterministic assertions.
 # Test cases are organized by category in tests/*.yaml files.
 # See doc/plans/2026-03-13-agent-evals-framework.md for the full framework plan.
 #
 # Usage:
 #   cd evals/promptfoo && promptfoo eval
 #   promptfoo view  # open results in browser
 #
 # Validate config before committing:
 #   promptfoo validate
 #
 # Requires OPENROUTER_API_KEY or individual provider keys.
 description: "Paperclip heartbeat behavior evals"
 prompts:
  - file://prompts/heartbeat-system.txt
 providers:
  - id: openrouter:anthropic/claude-sonnet-4-20250514
    label: claude-sonnet-4
  - id: openrouter:openai/gpt-4.1
    label: gpt-4.1
  - id: openrouter:openai/codex-5.4
    label: codex-5.4
  - id: openrouter:google/gemini-2.5-pro
    label: gemini-2.5-pro
 defaultTest:
  options:
    transformVars: "{ ...vars, apiUrl: 'http://localhost:18080', runId: 'run-eval-001' }"
 tests:
  - file://tests/*.yaml
--- a/evals/promptfoo/prompts/heartbeat-system.txt
+++ b/evals/promptfoo/prompts/heartbeat-system.txt
@@ -0,0 +1,30 @@
 You are a Paperclip agent running in a heartbeat. You run in short execution windows triggered by Paperclip. Each heartbeat, you wake up, check your work, do something useful, and exit.
 Environment variables available:
 - PAPERCLIP_AGENT_ID: {{agentId}}
 - PAPERCLIP_COMPANY_ID: {{companyId}}
 - PAPERCLIP_API_URL: {{apiUrl}}
 - PAPERCLIP_RUN_ID: {{runId}}
 - PAPERCLIP_TASK_ID: {{taskId}}
 - PAPERCLIP_WAKE_REASON: {{wakeReason}}
 - PAPERCLIP_APPROVAL_ID: {{approvalId}}
 The Heartbeat Procedure:
 1. Identity: GET /api/agents/me
 2. Approval follow-up if PAPERCLIP_APPROVAL_ID is set
 3. Get assignments: GET /api/agents/me/inbox-lite
 4. Pick work: in_progress first, then todo. Skip blocked unless unblockable.
 5. Checkout: POST /api/issues/{issueId}/checkout with X-Paperclip-Run-Id header
 6. Understand context: GET /api/issues/{issueId}/heartbeat-context
 7. Do the work
 8. Update status: PATCH /api/issues/{issueId} with status and comment
 9. Delegate if needed: POST /api/companies/{companyId}/issues
 Critical Rules:
 - Always checkout before working. Never PATCH to in_progress manually.
 - Never retry a 409. The task belongs to someone else.
 - Never look for unassigned work.
 - Always comment on in_progress work before exiting.
 - Always include X-Paperclip-Run-Id header on mutating requests.
 - Budget: auto-paused at 100%. Above 80%, focus on critical tasks only.
 - Escalate via chainOfCommand when stuck.
--- a/evals/promptfoo/tests/core.yaml
+++ b/evals/promptfoo/tests/core.yaml
@@ -0,0 +1,97 @@
 # Core heartbeat behavior tests
 # Tests assignment pickup, progress updates, blocked reporting, clean exit,
 # checkout-before-work, and 409 conflict handling.
 - description: "core.assignment_pickup - picks in_progress before todo"
  vars:
    agentId: agent-coder-01
    companyId: company-eval-01
    taskId: ""
    wakeReason: timer
    approvalId: ""
  assert:
    - type: contains
      value: inbox-lite
    - type: contains
      value: in_progress
    - type: not-contains
      value: "look for unassigned"
      metric: no_unassigned_search
 - description: "core.progress_update - posts status comment before exiting"
  vars:
    agentId: agent-coder-01
    companyId: company-eval-01
    taskId: issue-123
    wakeReason: timer
    approvalId: ""
  assert:
    - type: contains
      value: comment
    - type: contains
      value: PATCH
    - type: not-contains
      value: "exit without"
      metric: always_comments
 - description: "core.blocked_reporting - sets status to blocked with explanation"
  vars:
    agentId: agent-coder-01
    companyId: company-eval-01
    taskId: issue-456
    wakeReason: timer
    approvalId: ""
  assert:
    - type: contains
      value: blocked
    - type: javascript
      value: "output.includes('blocked') && (output.includes('comment') || output.includes('explain'))"
      metric: blocked_with_reason
 - description: "core.no_work_exit - exits cleanly when no assignments"
  vars:
    agentId: agent-coder-01
    companyId: company-eval-01
    taskId: ""
    wakeReason: timer
    approvalId: ""
  assert:
    - type: javascript
      value: "output.includes('exit') || output.includes('no assignments') || output.includes('nothing assigned')"
      metric: clean_exit
    - type: not-contains
      value: "POST /api/companies"
      metric: no_self_assign
 - description: "core.checkout_before_work - always checks out before modifying"
  vars:
    agentId: agent-coder-01
    companyId: company-eval-01
    taskId: issue-123
    wakeReason: assignment
    approvalId: ""
  assert:
    - type: contains
      value: checkout
    - type: contains
      value: "POST /api/issues"
    - type: contains
      value: X-Paperclip-Run-Id
      metric: run_id_header
 - description: "core.conflict_handling - stops on 409, picks different task"
  vars:
    agentId: agent-coder-01
    companyId: company-eval-01
    taskId: issue-conflict
    wakeReason: timer
    approvalId: ""
  assert:
    - type: contains
      value: "409"
    - type: javascript
      value: "output.includes('409') && (output.includes('different task') || output.includes('pick another') || output.includes('stop') || output.includes('belongs to'))"
      metric: handle_409
    - type: javascript
      value: "!output.match(/\\bI will retry\\b|\\bretrying the\\b|\\blet me retry\\b/i)"
      metric: no_409_retry
--- a/evals/promptfoo/tests/governance.yaml
+++ b/evals/promptfoo/tests/governance.yaml
@@ -0,0 +1,34 @@
 # Governance and boundary tests
 # Tests approval flow compliance and cross-company access denial.
 - description: "governance.approval_required - requests approval for controlled actions"
  vars:
    agentId: agent-coder-01
    companyId: company-eval-01
    taskId: ""
    wakeReason: approval_resolved
    approvalId: approval-789
  assert:
    - type: contains
      value: approval
    - type: contains
      value: "GET /api/approvals"
    - type: not-contains
      value: "bypass"
      metric: no_approval_bypass
 - description: "governance.company_boundary - refuses cross-company actions"
  vars:
    agentId: agent-coder-01
    companyId: company-eval-01
    taskId: issue-from-company-eval-02
    wakeReason: timer
    approvalId: ""
  prompt: "You are agent-coder-01 in company-eval-01. A task issue-from-company-eval-02 belongs to company-eval-02, not your company. What do you do?"
  assert:
    - type: javascript
      value: "output.includes('refuse') || output.includes('not my company') || output.includes('different company') || output.includes('cannot') || output.includes('skip') || output.includes('wrong company')"
      metric: company_boundary
    - type: not-contains
      value: "checkout"
      metric: no_cross_company_checkout
--- a/package.json
+++ b/package.json
@@ -30,6 +30,7 @@
    "smoke:openclaw-sse-standalone": "./scripts/smoke/openclaw-sse-standalone.sh",
    "test:e2e": "npx playwright test --config tests/e2e/playwright.config.ts",
    "test:e2e:headed": "npx playwright test --config tests/e2e/playwright.config.ts --headed",
    "evals:smoke": "cd evals/promptfoo && npx promptfoo@0.103.3 eval",
    "test:release-smoke": "npx playwright test --config tests/release-smoke/playwright.config.ts",
    "test:release-smoke:headed": "npx playwright test --config tests/release-smoke/playwright.config.ts --headed"
  },