paperclip

aevgarik/paperclip

Fork 0

Commit Graph

Author	SHA1	Message	Date
Matt Van Horn	cc40e1f8e9	refactor(evals): split test cases into tests/.yaml files Move inline test cases from promptfooconfig.yaml into separate files organized by category (core.yaml, governance.yaml). Main config now uses file://tests/.yaml glob pattern per promptfoo best practices. This makes it easier to add new test categories without bloating the main config, and lets contributors add cases by dropping new YAML files into tests/.	2026-03-15 12:15:51 -07:00
Matt Van Horn	a39579dad3	fix(evals): address Greptile review feedback - Make company_boundary test adversarial with cross-company stimulus - Replace fragile not-contains:retry with targeted JS assertion - Replace not-contains:create with not-contains:POST /api/companies - Pin promptfoo to 0.103.3 for reproducible eval runs - Fix npm -> pnpm in README prerequisites - Add trailing newline to system prompt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-03-13 17:19:25 -07:00
Matt Van Horn	fbb8d10305	feat(evals): bootstrap promptfoo eval framework (Phase 0) Implements Phase 0 of the agent evals framework plan from discussion #808 and PR #817. Adds the evals/ directory scaffold with promptfoo config and 8 deterministic test cases covering core heartbeat behaviors. Test cases: - core.assignment_pickup: picks in_progress before todo - core.progress_update: posts status comment before exiting - core.blocked_reporting: sets blocked status with explanation - governance.approval_required: reviews approval before acting - governance.company_boundary: refuses cross-company actions - core.no_work_exit: exits cleanly with no assignments - core.checkout_before_work: always checks out before modifying - core.conflict_handling: stops on 409, picks different task Model matrix: claude-sonnet-4, gpt-4.1, codex-5.4, gemini-2.5-pro via OpenRouter. Run with `pnpm evals:smoke`. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-03-13 17:09:51 -07:00

Author

SHA1

Message

Date

Matt Van Horn

cc40e1f8e9

refactor(evals): split test cases into tests/*.yaml files

Move inline test cases from promptfooconfig.yaml into separate files
organized by category (core.yaml, governance.yaml). Main config now
uses file://tests/*.yaml glob pattern per promptfoo best practices.

This makes it easier to add new test categories without bloating the
main config, and lets contributors add cases by dropping new YAML
files into tests/.

2026-03-15 12:15:51 -07:00

Matt Van Horn

a39579dad3

fix(evals): address Greptile review feedback

- Make company_boundary test adversarial with cross-company stimulus
- Replace fragile not-contains:retry with targeted JS assertion
- Replace not-contains:create with not-contains:POST /api/companies
- Pin promptfoo to 0.103.3 for reproducible eval runs
- Fix npm -> pnpm in README prerequisites
- Add trailing newline to system prompt

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Paperclip <noreply@paperclip.ing>

2026-03-13 17:19:25 -07:00

Matt Van Horn

fbb8d10305

feat(evals): bootstrap promptfoo eval framework (Phase 0)

Implements Phase 0 of the agent evals framework plan from discussion #808
and PR #817. Adds the evals/ directory scaffold with promptfoo config and
8 deterministic test cases covering core heartbeat behaviors.

Test cases:
- core.assignment_pickup: picks in_progress before todo
- core.progress_update: posts status comment before exiting
- core.blocked_reporting: sets blocked status with explanation
- governance.approval_required: reviews approval before acting
- governance.company_boundary: refuses cross-company actions
- core.no_work_exit: exits cleanly with no assignments
- core.checkout_before_work: always checks out before modifying
- core.conflict_handling: stops on 409, picks different task

Model matrix: claude-sonnet-4, gpt-4.1, codex-5.4, gemini-2.5-pro via
OpenRouter. Run with `pnpm evals:smoke`.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Paperclip <noreply@paperclip.ing>

2026-03-13 17:09:51 -07:00

3 Commits