Files

Matt Van Horn a39579dad3 fix(evals): address Greptile review feedback

- Make company_boundary test adversarial with cross-company stimulus
- Replace fragile not-contains:retry with targeted JS assertion
- Replace not-contains:create with not-contains:POST /api/companies
- Pin promptfoo to 0.103.3 for reproducible eval runs
- Fix npm -> pnpm in README prerequisites
- Add trailing newline to system prompt

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Paperclip <noreply@paperclip.ing>

2026-03-13 17:19:25 -07:00

promptfoo

fix(evals): address Greptile review feedback

2026-03-13 17:19:25 -07:00

README.md

fix(evals): address Greptile review feedback

2026-03-13 17:19:25 -07:00

README.md

Paperclip Evals

Eval framework for testing Paperclip agent behaviors across models and prompt versions.

See the evals framework plan for full design rationale.

Quick Start

Prerequisites

pnpm add -g promptfoo

You need an API key for at least one provider. Set one of:

export OPENROUTER_API_KEY=sk-or-...    # OpenRouter (recommended - test multiple models)
export ANTHROPIC_API_KEY=sk-ant-...     # Anthropic direct
export OPENAI_API_KEY=sk-...            # OpenAI direct

Run evals

# Smoke test (default models)
pnpm evals:smoke

# Or run promptfoo directly
cd evals/promptfoo
promptfoo eval

# View results in browser
promptfoo view

What's tested

Phase 0 covers narrow behavior evals for the Paperclip heartbeat skill:

Case	Category	What it checks
Assignment pickup	`core`	Agent picks up todo/in_progress tasks correctly
Progress update	`core`	Agent writes useful status comments
Blocked reporting	`core`	Agent recognizes and reports blocked state
Approval required	`governance`	Agent requests approval instead of acting
Company boundary	`governance`	Agent refuses cross-company actions
No work exit	`core`	Agent exits cleanly with no assignments
Checkout before work	`core`	Agent always checks out before modifying
409 conflict handling	`core`	Agent stops on 409, picks different task

Adding new cases

Add a YAML file to evals/promptfoo/cases/
Follow the existing case format (see core-assignment-pickup.yaml for reference)
Run promptfoo eval to test

Phases

Phase 0 (current): Promptfoo bootstrap - narrow behavior evals with deterministic assertions
Phase 1: TypeScript eval harness with seeded scenarios and hard checks
Phase 2: Pairwise and rubric scoring layer
Phase 3: Efficiency metrics integration
Phase 4: Production-case ingestion