Move inline test cases from promptfooconfig.yaml into separate files
organized by category (core.yaml, governance.yaml). Main config now
uses file://tests/*.yaml glob pattern per promptfoo best practices.
This makes it easier to add new test categories without bloating the
main config, and lets contributors add cases by dropping new YAML
files into tests/.
- Make company_boundary test adversarial with cross-company stimulus
- Replace fragile not-contains:retry with targeted JS assertion
- Replace not-contains:create with not-contains:POST /api/companies
- Pin promptfoo to 0.103.3 for reproducible eval runs
- Fix npm -> pnpm in README prerequisites
- Add trailing newline to system prompt
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Paperclip <noreply@paperclip.ing>
Implements Phase 0 of the agent evals framework plan from discussion #808
and PR #817. Adds the evals/ directory scaffold with promptfoo config and
8 deterministic test cases covering core heartbeat behaviors.
Test cases:
- core.assignment_pickup: picks in_progress before todo
- core.progress_update: posts status comment before exiting
- core.blocked_reporting: sets blocked status with explanation
- governance.approval_required: reviews approval before acting
- governance.company_boundary: refuses cross-company actions
- core.no_work_exit: exits cleanly with no assignments
- core.checkout_before_work: always checks out before modifying
- core.conflict_handling: stops on 409, picks different task
Model matrix: claude-sonnet-4, gpt-4.1, codex-5.4, gemini-2.5-pro via
OpenRouter. Run with `pnpm evals:smoke`.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Paperclip <noreply@paperclip.ing>
When creating a new issue with a pre-filled assignee or project (e.g. from
a project page), Tab from the title field now skips over fields that already
have values, going directly to the next empty field or description.
Co-Authored-By: Paperclip <noreply@paperclip.ing>
When creating a new issue with a pre-filled assignee or project (e.g. from
a project page), Tab from the title field now skips over fields that already
have values, going directly to the next empty field or description.
Co-Authored-By: Paperclip <noreply@paperclip.ing>
Move plans from doc/plan/ into doc/plans/ and add YYYY-MM-DD date
prefixes to all undated plan files based on document headers or
earliest git commit dates.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The sidebar Documentation links were pointing to an internal /docs route.
Updated both mobile and desktop sidebar instances to link to
https://docs.paperclip.ing/ in a new tab instead.
Co-Authored-By: Paperclip <noreply@paperclip.ing>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add `worktree:cleanup <name>` command that safely removes a worktree,
its branch, and instance data. Idempotent and safe by default — refuses
to delete branches with unique unmerged commits or worktrees with
uncommitted changes unless --force is passed.
- Support PAPERCLIP_WORKTREES_DIR env var as default for --home across
all worktree commands.
- Support PAPERCLIP_WORKTREE_START_POINT env var as default for
--start-point on worktree:make.
- Auto-prefix worktree names with "paperclip-" if not already present,
so `worktree:make subissues` creates ~/paperclip-subissues.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix malformed try/catch/finally blocks in heartbeat executeRun
- Declare activeRunExecutions Set to track in-flight runs
- Add resumeQueuedRuns function and export from heartbeat service
- Add initdbFlags to EmbeddedPostgresCtor type
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- reapOrphanedRuns() now only scans running runs; queued runs are
legitimately absent from runningProcesses (waiting on concurrency
limits or issue locks) so including them caused false process_lost
failures (closes#90)
- Add module-level activeRunExecutions set so non-child-process adapters
(http, openclaw) are protected from the reaper during execution
- Add resumeQueuedRuns() to restart persisted queued runs after a server
restart, called at startup and each periodic tick
- Add outer catch in executeRun() so setup failures (ensureRuntimeState,
resolveWorkspaceForRun, etc.) are recorded as failed runs instead of
leaving them stuck in running state
- Guard resumeQueuedRuns() against paused/terminated/pending_approval agents
- Increase opencode models discovery timeout from 20s to 45s