Merge pull request #817 from paperclipai/docs/agent-evals-framework-plan

docs: add agent evals framework plan
2026-03-13 15:17:40 -05:00
parent aaadbdc144 db81a06386
commit 8eacc9c697
1 changed files with 775 additions and 0 deletions
--- a/doc/plans/2026-03-13-agent-evals-framework.md
+++ b/doc/plans/2026-03-13-agent-evals-framework.md
@@ -0,0 +1,775 @@
+# Agent Evals Framework Plan
+
+Date: 2026-03-13
+
+## Context
+
+We need evals for the thing Paperclip actually ships:
+
+- agent behavior produced by adapter config
+- prompt templates and bootstrap prompts
+- skill sets and skill instructions
+- model choice
+- runtime policy choices that affect outcomes and cost
+
+We do **not** primarily need a fine-tuning pipeline.
+We need a regression framework that can answer:
+
+- if we change prompts or skills, do agents still do the right thing?
+- if we switch models, what got better, worse, or more expensive?
+- if we optimize tokens, did we preserve task outcomes?
+- can we grow the suite over time from real Paperclip usage?
+
+This plan is based on:
+
+- `doc/GOAL.md`
+- `doc/PRODUCT.md`
+- `doc/SPEC-implementation.md`
+- `docs/agents-runtime.md`
+- `doc/plans/2026-03-13-TOKEN-OPTIMIZATION-PLAN.md`
+- Discussion #449: <https://github.com/paperclipai/paperclip/discussions/449>
+- OpenAI eval best practices: <https://developers.openai.com/api/docs/guides/evaluation-best-practices>
+- Promptfoo docs: <https://www.promptfoo.dev/docs/configuration/test-cases/> and <https://www.promptfoo.dev/docs/providers/custom-api/>
+- LangSmith complex agent eval docs: <https://docs.langchain.com/langsmith/evaluate-complex-agent>
+- Braintrust dataset/scorer docs: <https://www.braintrust.dev/docs/annotate/datasets> and <https://www.braintrust.dev/docs/evaluate/write-scorers>
+
+## Recommendation
+
+Paperclip should take a **two-stage approach**:
+
+1. **Start with Promptfoo now** for narrow, prompt-and-skill behavior evals across models.
+2. **Grow toward a first-party, repo-local eval harness in TypeScript** for full Paperclip scenario evals.
+
+So the recommendation is no longer “skip Promptfoo.” It is:
+
+- use Promptfoo as the fastest bootstrap layer
+- keep eval cases and fixtures in this repo
+- avoid making Promptfoo config the deepest long-term abstraction
+
+More specifically:
+
+1. The canonical eval definitions should live in this repo under a top-level `evals/` directory.
+2. `v0` should use Promptfoo to run focused test cases across models and providers.
+3. The longer-term harness should run **real Paperclip scenarios** against seeded companies/issues/agents, not just raw prompt completions.
+4. The scoring model should combine:
+   - deterministic checks
+   - structured rubric scoring
+   - pairwise candidate-vs-baseline judging
+   - efficiency metrics from normalized usage/cost telemetry
+5. The framework should compare **bundles**, not just models.
+
+A bundle is:
+
+- adapter type
+- model id
+- prompt template(s)
+- bootstrap prompt template
+- skill allowlist / skill content version
+- relevant runtime flags
+
+That is the right unit because that is what actually changes behavior in Paperclip.
+
+## Why This Is The Right Shape
+
+### 1. We need to evaluate system behavior, not only prompt output
+
+Prompt-only tools are useful, but Paperclip’s real failure modes are often:
+
+- wrong issue chosen
+- wrong API call sequence
+- bad delegation
+- failure to respect approval boundaries
+- stale session behavior
+- over-reading context
+- claiming completion without producing artifacts or comments
+
+Those are control-plane behaviors. They require scenario setup, execution, and trace inspection.
+
+### 2. The repo is already TypeScript-first
+
+The existing monorepo already uses:
+
+- `pnpm`
+- `tsx`
+- `vitest`
+- TypeScript across server, UI, shared contracts, and adapters
+
+A TypeScript-first harness will fit the repo and CI better than introducing a Python-first test subsystem as the default path.
+
+Python can stay optional later for specialty scorers or research experiments.
+
+### 3. We need provider/model comparison without vendor lock-in
+
+OpenAI’s guidance is directionally right:
+
+- eval early and often
+- use task-specific evals
+- log everything
+- prefer pairwise/comparison-style judging over open-ended scoring
+
+But OpenAI’s Evals API is not the right control plane for Paperclip as the primary system because our target is explicitly multi-model and multi-provider.
+
+### 4. Hosted eval products are useful, and Promptfoo is the right bootstrap tool
+
+The current tradeoff:
+
+- Promptfoo is very good for local, repo-based prompt/provider matrices and CI integration.
+- LangSmith is strong on trajectory-style agent evals.
+- Braintrust has a clean dataset + scorer + experiment model and strong TypeScript support.
+
+The community suggestion is directionally right:
+
+- Promptfoo lets us start small
+- it supports simple assertions like contains / not-contains / regex / custom JS
+- it can run the same cases across multiple models
+- it supports OpenRouter
+- it can move into CI later
+
+That makes it the best `v0` tool for “did this prompt/skill/model change obviously regress?”
+
+But Paperclip should still avoid making a hosted platform or a third-party config format the core abstraction before we have our own stable eval model.
+
+The right move is:
+
+- start with Promptfoo for quick wins
+- keep the data portable and repo-owned
+- build a thin first-party harness around Paperclip concepts as the system grows
+- optionally export to or integrate with other tools later if useful
+
+## What We Should Evaluate
+
+We should split evals into four layers.
+
+### Layer 1: Deterministic contract evals
+
+These should require no judge model.
+
+Examples:
+
+- agent comments on the assigned issue
+- no mutation outside the agent’s company
+- approval-required actions do not bypass approval flow
+- task transitions are legal
+- output contains required structured fields
+- artifact links exist when the task required an artifact
+- no full-thread refetch on delta-only cases once the API supports it
+
+These are cheap, reliable, and should be the first line of defense.
+
+### Layer 2: Single-step behavior evals
+
+These test narrow behaviors in isolation.
+
+Examples:
+
+- chooses the correct issue from inbox
+- writes a reasonable first status comment
+- decides to ask for approval instead of acting directly
+- delegates to the correct report
+- recognizes blocked state and reports it clearly
+
+These are the closest thing to prompt evals, but still framed in Paperclip terms.
+
+### Layer 3: End-to-end scenario evals
+
+These run a full heartbeat or short sequence of heartbeats against a seeded scenario.
+
+Examples:
+
+- new assignment pickup
+- long-thread continuation
+- mention-triggered clarification
+- approval-gated hire request
+- manager escalation
+- workspace coding task that must leave a meaningful issue update
+
+These should evaluate both final state and trace quality.
+
+### Layer 4: Efficiency and regression evals
+
+These are not “did the answer look good?” evals. They are “did we preserve quality while improving cost/latency?” evals.
+
+Examples:
+
+- normalized input tokens per successful heartbeat
+- normalized tokens per completed issue
+- session reuse rate
+- full-thread reload rate
+- wall-clock duration
+- cost per successful scenario
+
+This layer is especially important for token optimization work.
+
+## Core Design
+
+## 1. Canonical object: `EvalCase`
+
+Each eval case should define:
+
+- scenario setup
+- target bundle(s)
+- execution mode
+- expected invariants
+- scoring rubric
+- tags/metadata
+
+Suggested shape:
+
+```ts
+type EvalCase = {
+  id: string;
+  description: string;
+  tags: string[];
+  setup: {
+    fixture: string;
+    agentId: string;
+    trigger: "assignment" | "timer" | "on_demand" | "comment" | "approval";
+  };
+  inputs?: Record<string, unknown>;
+  checks: {
+    hard: HardCheck[];
+    rubric?: RubricCheck[];
+    pairwise?: PairwiseCheck[];
+  };
+  metrics: MetricSpec[];
+};
+```
+
+The important part is that the case is about a Paperclip scenario, not a standalone prompt string.
+
+## 2. Canonical object: `EvalBundle`
+
+Suggested shape:
+
+```ts
+type EvalBundle = {
+  id: string;
+  adapter: string;
+  model: string;
+  promptTemplate: string;
+  bootstrapPromptTemplate?: string;
+  skills: string[];
+  flags?: Record<string, string | number | boolean>;
+};
+```
+
+Every comparison run should say which bundle was tested.
+
+This avoids the common mistake of saying “model X is better” when the real change was model + prompt + skills + runtime behavior.
+
+## 3. Canonical output: `EvalTrace`
+
+We should capture a normalized trace for scoring:
+
+- run ids
+- prompts actually sent
+- session reuse metadata
+- issue mutations
+- comments created
+- approvals requested
+- artifacts created
+- token/cost telemetry
+- timing
+- raw outputs
+
+The scorer layer should never need to scrape ad hoc logs.
+
+## Scoring Framework
+
+## 1. Hard checks first
+
+Every eval should start with pass/fail checks that can invalidate the run immediately.
+
+Examples:
+
+- touched wrong company
+- skipped required approval
+- no issue update produced
+- returned malformed structured output
+- marked task done without required artifact
+
+If a hard check fails, the scenario fails regardless of style or judge score.
+
+## 2. Rubric scoring second
+
+Rubric scoring should use narrow criteria, not vague “how good was this?” prompts.
+
+Good rubric dimensions:
+
+- task understanding
+- governance compliance
+- useful progress communication
+- correct delegation
+- evidence of completion
+- concision / unnecessary verbosity
+
+Each rubric should be a small 0-1 or 0-2 decision, not a mushy 1-10 scale.
+
+## 3. Pairwise judging for candidate vs baseline
+
+OpenAI’s eval guidance is right that LLMs are better at discrimination than open-ended generation.
+
+So for non-deterministic quality checks, the default pattern should be:
+
+- run baseline bundle on the case
+- run candidate bundle on the same case
+- ask a judge model which is better on explicit criteria
+- allow `baseline`, `candidate`, or `tie`
+
+This is better than asking a judge for an absolute quality score with no anchor.
+
+## 4. Efficiency scoring is separate
+
+Do not bury efficiency inside a single blended quality score.
+
+Record it separately:
+
+- quality score
+- cost score
+- latency score
+
+Then compute a summary decision such as:
+
+- candidate is acceptable only if quality is non-inferior and efficiency is improved
+
+That is much easier to reason about than one magic number.
+
+## Suggested Decision Rule
+
+For PR gating:
+
+1. No hard-check regressions.
+2. No significant regression on required scenario pass rate.
+3. No significant regression on key rubric dimensions.
+4. If the change is token-optimization-oriented, require efficiency improvement on target scenarios.
+
+For deeper comparison reports, show:
+
+- pass rate
+- pairwise wins/losses/ties
+- median normalized tokens
+- median wall-clock time
+- cost deltas
+
+## Dataset Strategy
+
+We should explicitly build the dataset from three sources.
+
+### 1. Hand-authored seed cases
+
+Start here.
+
+These should cover core product invariants:
+
+- assignment pickup
+- status update
+- blocked reporting
+- delegation
+- approval request
+- cross-company access denial
+- issue comment follow-up
+
+These are small, clear, and stable.
+
+### 2. Production-derived cases
+
+Per OpenAI’s guidance, we should log everything and mine real usage for eval cases.
+
+Paperclip should grow eval coverage by promoting real runs into cases when we see:
+
+- regressions
+- interesting failures
+- edge cases
+- high-value success patterns worth preserving
+
+The initial version can be manual:
+
+- take a real run
+- redact/normalize it
+- convert it into an `EvalCase`
+
+Later we can automate trace-to-case generation.
+
+### 3. Adversarial and guardrail cases
+
+These should intentionally probe failure modes:
+
+- approval bypass attempts
+- wrong-company references
+- stale context traps
+- irrelevant long threads
+- misleading instructions in comments
+- verbosity traps
+
+This is where promptfoo-style red-team ideas can become useful later, but it is not the first slice.
+
+## Repo Layout
+
+Recommended initial layout:
+
+```text
+evals/
+  README.md
+  promptfoo/
+    promptfooconfig.yaml
+    prompts/
+    cases/
+  cases/
+    core/
+    approvals/
+    delegation/
+    efficiency/
+  fixtures/
+    companies/
+    issues/
+  bundles/
+    baseline/
+    experiments/
+  runners/
+    scenario-runner.ts
+    compare-runner.ts
+  scorers/
+    hard/
+    rubric/
+    pairwise/
+  judges/
+    rubric-judge.ts
+    pairwise-judge.ts
+  lib/
+    types.ts
+    traces.ts
+    metrics.ts
+  reports/
+    .gitignore
+```
+
+Why top-level `evals/`:
+
+- it makes evals feel first-class
+- it avoids hiding them inside `server/` even though they span adapters and runtime behavior
+- it leaves room for both TS and optional Python helpers later
+- it gives us a clean place for Promptfoo `v0` config plus the later first-party runner
+
+## Execution Model
+
+The harness should support three modes.
+
+### Mode A: Cheap local smoke
+
+Purpose:
+
+- run on PRs
+- keep cost low
+- catch obvious regressions
+
+Characteristics:
+
+- 5 to 20 cases
+- 1 or 2 bundles
+- mostly hard checks and narrow rubrics
+
+### Mode B: Candidate vs baseline compare
+
+Purpose:
+
+- evaluate a prompt/skill/model change before merge
+
+Characteristics:
+
+- paired runs
+- pairwise judging enabled
+- quality + efficiency diff report
+
+### Mode C: Nightly broader matrix
+
+Purpose:
+
+- compare multiple models and bundles
+- grow historical benchmark data
+
+Characteristics:
+
+- larger case set
+- multiple models
+- more expensive rubric/pairwise judging
+
+## CI and Developer Workflow
+
+Suggested commands:
+
+```sh
+pnpm evals:smoke
+pnpm evals:compare --baseline baseline/codex-default --candidate experiments/codex-lean-skillset
+pnpm evals:nightly
+```
+
+PR behavior:
+
+- run `evals:smoke` on prompt/skill/adapter/runtime changes
+- optionally trigger `evals:compare` for labeled PRs or manual runs
+
+Nightly behavior:
+
+- run larger matrix
+- save report artifact
+- surface trend lines on pass rate, pairwise wins, and efficiency
+
+## Framework Comparison
+
+## Promptfoo
+
+Best use for Paperclip:
+
+- prompt-level micro-evals
+- provider/model comparison
+- quick local CI integration
+- custom JS assertions and custom providers
+- bootstrap-layer evals for one skill or one agent workflow
+
+What changed in this recommendation:
+
+- Promptfoo is now the recommended **starting point**
+- especially for “one skill, a handful of cases, compare across models”
+
+Why it still should not be the only long-term system:
+
+- its primary abstraction is still prompt/provider/test-case oriented
+- Paperclip needs scenario setup, control-plane state inspection, and multi-step traces as first-class concepts
+
+Recommendation:
+
+- use Promptfoo first
+- store Promptfoo config and cases in-repo under `evals/promptfoo/`
+- use custom JS/TS assertions and, if needed later, a custom provider that calls Paperclip scenario runners
+- do not make Promptfoo YAML the only canonical Paperclip eval format once we outgrow prompt-level evals
+
+## LangSmith
+
+What it gets right:
+
+- final response evals
+- trajectory evals
+- single-step evals
+
+Why not the primary system today:
+
+- stronger fit for teams already centered on LangChain/LangGraph
+- introduces hosted/external workflow gravity before our own eval model is stable
+
+Recommendation:
+
+- copy the trajectory/final/single-step taxonomy
+- do not adopt the platform as the default requirement
+
+## Braintrust
+
+What it gets right:
+
+- TypeScript support
+- clean dataset/task/scorer model
+- production logging to datasets
+- experiment comparison over time
+
+Why not the primary system today:
+
+- still externalizes the canonical dataset and review workflow
+- we are not yet at the maturity where hosted experiment management should define the shape of the system
+
+Recommendation:
+
+- borrow its dataset/scorer/experiment mental model
+- revisit once we want hosted review and experiment history at scale
+
+## OpenAI Evals / Evals API
+
+What it gets right:
+
+- strong eval principles
+- emphasis on task-specific evals
+- continuous evaluation mindset
+
+Why not the primary system:
+
+- Paperclip must compare across models/providers
+- we do not want our primary eval runner coupled to one model vendor
+
+Recommendation:
+
+- use the guidance
+- do not use it as the core Paperclip eval runtime
+
+## First Implementation Slice
+
+The first version should be intentionally small.
+
+## Phase 0: Promptfoo bootstrap
+
+Build:
+
+- `evals/promptfoo/promptfooconfig.yaml`
+- 5 to 10 focused cases for one skill or one agent workflow
+- model matrix using the providers we care about most
+- mostly deterministic assertions:
+  - contains
+  - not-contains
+  - regex
+  - custom JS assertions
+
+Target scope:
+
+- one skill, or one narrow workflow such as assignment pickup / first status update
+- compare a small set of bundles across several models
+
+Success criteria:
+
+- we can run one command and compare outputs across models
+- prompt/skill regressions become visible quickly
+- the team gets signal before building heavier infrastructure
+
+## Phase 1: Skeleton and core cases
+
+Build:
+
+- `evals/` scaffold
+- `EvalCase`, `EvalBundle`, `EvalTrace` types
+- scenario runner for seeded local cases
+- 10 hand-authored core cases
+- hard checks only
+
+Target cases:
+
+- assigned issue pickup
+- write progress comment
+- ask for approval when required
+- respect company boundary
+- report blocked state
+- avoid marking done without artifact/comment evidence
+
+Success criteria:
+
+- a developer can run a local smoke suite
+- prompt/skill changes can fail the suite deterministically
+- Promptfoo `v0` cases either migrate into or coexist with this layer cleanly
+
+## Phase 2: Pairwise and rubric layer
+
+Build:
+
+- rubric scorer interface
+- pairwise judge runner
+- candidate vs baseline compare command
+- markdown/html report output
+
+Success criteria:
+
+- model/prompt bundle changes produce a readable diff report
+- we can tell “better”, “worse”, or “same” on curated scenarios
+
+## Phase 3: Efficiency integration
+
+Build:
+
+- normalized token/cost metrics into eval traces
+- cost and latency comparisons
+- efficiency gates for token optimization work
+
+Dependency:
+
+- this should align with the telemetry normalization work in `2026-03-13-TOKEN-OPTIMIZATION-PLAN.md`
+
+Success criteria:
+
+- quality and efficiency can be judged together
+- token-reduction work no longer relies on anecdotal improvements
+
+## Phase 4: Production-case ingestion
+
+Build:
+
+- tooling to promote real runs into new eval cases
+- metadata tagging
+- failure corpus growth process
+
+Success criteria:
+
+- the eval suite grows from real product behavior instead of staying synthetic
+
+## Initial Case Categories
+
+We should start with these categories:
+
+1. `core.assignment_pickup`
+2. `core.progress_update`
+3. `core.blocked_reporting`
+4. `governance.approval_required`
+5. `governance.company_boundary`
+6. `delegation.correct_report`
+7. `threads.long_context_followup`
+8. `efficiency.no_unnecessary_reloads`
+
+That is enough to start catching the classes of regressions we actually care about.
+
+## Important Guardrails
+
+### 1. Do not rely on judge models alone
+
+Every important scenario needs deterministic checks first.
+
+### 2. Do not gate PRs on a single noisy score
+
+Use pass/fail invariants plus a small number of stable rubric or pairwise checks.
+
+### 3. Do not confuse benchmark score with product quality
+
+The suite must keep growing from real runs, otherwise it will become a toy benchmark.
+
+### 4. Do not evaluate only final output
+
+Trajectory matters for agents:
+
+- did they call the right Paperclip APIs?
+- did they ask for approval?
+- did they communicate progress?
+- did they choose the right issue?
+
+### 5. Do not make the framework vendor-shaped
+
+Our eval model should survive changes in:
+
+- judge provider
+- candidate provider
+- adapter implementation
+- hosted tooling choices
+
+## Open Questions
+
+1. Should the first scenario runner invoke the real server over HTTP, or call services directly in-process?
+   My recommendation: start in-process for speed, then add HTTP-mode coverage once the model stabilizes.
+
+2. Should we support Python scorers in v1?
+   My recommendation: no. Keep v1 all-TypeScript.
+
+3. Should we commit baseline outputs?
+   My recommendation: commit case definitions and bundle definitions, but keep run artifacts out of git.
+
+4. Should we add hosted experiment tracking immediately?
+   My recommendation: no. Revisit after the local harness proves useful.
+
+## Final Recommendation
+
+Start with Promptfoo for immediate, narrow model-and-prompt comparisons, then grow into a first-party `evals/` framework in TypeScript that evaluates **Paperclip scenarios and bundles**, not just prompts.
+
+Use this structure:
+
+- Promptfoo for `v0` bootstrap
+- deterministic hard checks as the foundation
+- rubric and pairwise judging for non-deterministic quality
+- normalized efficiency metrics as a separate axis
+- repo-local datasets that grow from real runs
+
+Use external tools selectively:
+
+- Promptfoo as the initial path for narrow prompt/provider tests
+- Braintrust or LangSmith later if we want hosted experiment management
+
+But keep the canonical eval model inside the Paperclip repo and aligned to Paperclip’s actual control-plane behaviors.