paperclip/doc/plans/2026-03-13-agent-evals-framework.md

# Agent Evals Framework Plan

Date: 2026-03-13

## Context

We need evals for the thing Paperclip actually ships:

- agent behavior produced by adapter config
- prompt templates and bootstrap prompts
- skill sets and skill instructions
- model choice
- runtime policy choices that affect outcomes and cost

We do **not** primarily need a fine-tuning pipeline.
We need a regression framework that can answer:

- if we change prompts or skills, do agents still do the right thing?
- if we switch models, what got better, worse, or more expensive?
- if we optimize tokens, did we preserve task outcomes?
- can we grow the suite over time from real Paperclip usage?

This plan is based on:

- `doc/GOAL.md`
- `doc/PRODUCT.md`
- `doc/SPEC-implementation.md`
- `docs/agents-runtime.md`
- `doc/plans/2026-03-13-TOKEN-OPTIMIZATION-PLAN.md`
- Discussion #449: <https://github.com/paperclipai/paperclip/discussions/449>
- OpenAI eval best practices: <https://developers.openai.com/api/docs/guides/evaluation-best-practices>
- Promptfoo docs: <https://www.promptfoo.dev/docs/configuration/test-cases/> and <https://www.promptfoo.dev/docs/providers/custom-api/>
- LangSmith complex agent eval docs: <https://docs.langchain.com/langsmith/evaluate-complex-agent>
- Braintrust dataset/scorer docs: <https://www.braintrust.dev/docs/annotate/datasets> and <https://www.braintrust.dev/docs/evaluate/write-scorers>

## Recommendation

Paperclip should take a **two-stage approach**:

1. **Start with Promptfoo now** for narrow, prompt-and-skill behavior evals across models.
2. **Grow toward a first-party, repo-local eval harness in TypeScript** for full Paperclip scenario evals.

So the recommendation is no longer “skip Promptfoo.” It is:

- use Promptfoo as the fastest bootstrap layer
- keep eval cases and fixtures in this repo
- avoid making Promptfoo config the deepest long-term abstraction

More specifically:

1. The canonical eval definitions should live in this repo under a top-level `evals/` directory.
2. `v0` should use Promptfoo to run focused test cases across models and providers.
3. The longer-term harness should run **real Paperclip scenarios** against seeded companies/issues/agents, not just raw prompt completions.
4. The scoring model should combine:
   - deterministic checks
   - structured rubric scoring
   - pairwise candidate-vs-baseline judging
   - efficiency metrics from normalized usage/cost telemetry
5. The framework should compare **bundles**, not just models.

A bundle is:

- adapter type
- model id
- prompt template(s)
- bootstrap prompt template
- skill allowlist / skill content version
- relevant runtime flags

That is the right unit because that is what actually changes behavior in Paperclip.

## Why This Is The Right Shape

### 1. We need to evaluate system behavior, not only prompt output

Prompt-only tools are useful, but Paperclip’s real failure modes are often:

- wrong issue chosen
- wrong API call sequence
- bad delegation
- failure to respect approval boundaries
- stale session behavior
- over-reading context
- claiming completion without producing artifacts or comments

Those are control-plane behaviors. They require scenario setup, execution, and trace inspection.

### 2. The repo is already TypeScript-first

The existing monorepo already uses:

- `pnpm`
- `tsx`
- `vitest`
- TypeScript across server, UI, shared contracts, and adapters

A TypeScript-first harness will fit the repo and CI better than introducing a Python-first test subsystem as the default path.

Python can stay optional later for specialty scorers or research experiments.

### 3. We need provider/model comparison without vendor lock-in

OpenAI’s guidance is directionally right:

- eval early and often
- use task-specific evals
- log everything
- prefer pairwise/comparison-style judging over open-ended scoring

But OpenAI’s Evals API is not the right control plane for Paperclip as the primary system because our target is explicitly multi-model and multi-provider.

### 4. Hosted eval products are useful, and Promptfoo is the right bootstrap tool

The current tradeoff:

- Promptfoo is very good for local, repo-based prompt/provider matrices and CI integration.
- LangSmith is strong on trajectory-style agent evals.
- Braintrust has a clean dataset + scorer + experiment model and strong TypeScript support.

The community suggestion is directionally right:

- Promptfoo lets us start small
- it supports simple assertions like contains / not-contains / regex / custom JS
- it can run the same cases across multiple models
- it supports OpenRouter
- it can move into CI later

That makes it the best `v0` tool for “did this prompt/skill/model change obviously regress?”

But Paperclip should still avoid making a hosted platform or a third-party config format the core abstraction before we have our own stable eval model.

The right move is:

- start with Promptfoo for quick wins
- keep the data portable and repo-owned
- build a thin first-party harness around Paperclip concepts as the system grows
- optionally export to or integrate with other tools later if useful

## What We Should Evaluate

We should split evals into four layers.

### Layer 1: Deterministic contract evals

These should require no judge model.

Examples:

- agent comments on the assigned issue
- no mutation outside the agent’s company
- approval-required actions do not bypass approval flow
- task transitions are legal
- output contains required structured fields
- artifact links exist when the task required an artifact
- no full-thread refetch on delta-only cases once the API supports it

These are cheap, reliable, and should be the first line of defense.

### Layer 2: Single-step behavior evals

These test narrow behaviors in isolation.

Examples:

- chooses the correct issue from inbox
- writes a reasonable first status comment
- decides to ask for approval instead of acting directly
- delegates to the correct report
- recognizes blocked state and reports it clearly

These are the closest thing to prompt evals, but still framed in Paperclip terms.

### Layer 3: End-to-end scenario evals

These run a full heartbeat or short sequence of heartbeats against a seeded scenario.

Examples:

- new assignment pickup
- long-thread continuation
- mention-triggered clarification
- approval-gated hire request
- manager escalation
- workspace coding task that must leave a meaningful issue update

These should evaluate both final state and trace quality.

### Layer 4: Efficiency and regression evals

These are not “did the answer look good?” evals. They are “did we preserve quality while improving cost/latency?” evals.

Examples:

- normalized input tokens per successful heartbeat
- normalized tokens per completed issue
- session reuse rate
- full-thread reload rate
- wall-clock duration
- cost per successful scenario

This layer is especially important for token optimization work.

## Core Design

## 1. Canonical object: `EvalCase`

Each eval case should define:

- scenario setup
- target bundle(s)
- execution mode
- expected invariants
- scoring rubric
- tags/metadata

Suggested shape:

```ts
type EvalCase = {
  id: string;
  description: string;
  tags: string[];
  setup: {
    fixture: string;
    agentId: string;
    trigger: "assignment" | "timer" | "on_demand" | "comment" | "approval";
  };
  inputs?: Record<string, unknown>;
  checks: {
    hard: HardCheck[];
    rubric?: RubricCheck[];
    pairwise?: PairwiseCheck[];
  };
  metrics: MetricSpec[];
};
```

The important part is that the case is about a Paperclip scenario, not a standalone prompt string.

## 2. Canonical object: `EvalBundle`

Suggested shape:

```ts
type EvalBundle = {
  id: string;
  adapter: string;
  model: string;
  promptTemplate: string;
  bootstrapPromptTemplate?: string;
  skills: string[];
  flags?: Record<string, string | number | boolean>;
};
```

Every comparison run should say which bundle was tested.

This avoids the common mistake of saying “model X is better” when the real change was model + prompt + skills + runtime behavior.

## 3. Canonical output: `EvalTrace`

We should capture a normalized trace for scoring:

- run ids
- prompts actually sent
- session reuse metadata
- issue mutations
- comments created
- approvals requested
- artifacts created
- token/cost telemetry
- timing
- raw outputs

The scorer layer should never need to scrape ad hoc logs.

## Scoring Framework

## 1. Hard checks first

Every eval should start with pass/fail checks that can invalidate the run immediately.

Examples:

- touched wrong company
- skipped required approval
- no issue update produced
- returned malformed structured output
- marked task done without required artifact

If a hard check fails, the scenario fails regardless of style or judge score.

## 2. Rubric scoring second

Rubric scoring should use narrow criteria, not vague “how good was this?” prompts.

Good rubric dimensions:

- task understanding
- governance compliance
- useful progress communication
- correct delegation
- evidence of completion
- concision / unnecessary verbosity

Each rubric should be a small 0-1 or 0-2 decision, not a mushy 1-10 scale.

## 3. Pairwise judging for candidate vs baseline

OpenAI’s eval guidance is right that LLMs are better at discrimination than open-ended generation.

So for non-deterministic quality checks, the default pattern should be:

- run baseline bundle on the case
- run candidate bundle on the same case
- ask a judge model which is better on explicit criteria
- allow `baseline`, `candidate`, or `tie`

This is better than asking a judge for an absolute quality score with no anchor.

## 4. Efficiency scoring is separate

Do not bury efficiency inside a single blended quality score.

Record it separately:

- quality score
- cost score
- latency score

Then compute a summary decision such as:

- candidate is acceptable only if quality is non-inferior and efficiency is improved

That is much easier to reason about than one magic number.

## Suggested Decision Rule

For PR gating:

1. No hard-check regressions.
2. No significant regression on required scenario pass rate.
3. No significant regression on key rubric dimensions.
4. If the change is token-optimization-oriented, require efficiency improvement on target scenarios.

For deeper comparison reports, show:

- pass rate
- pairwise wins/losses/ties
- median normalized tokens
- median wall-clock time
- cost deltas

## Dataset Strategy

We should explicitly build the dataset from three sources.

### 1. Hand-authored seed cases

Start here.

These should cover core product invariants:

- assignment pickup
- status update
- blocked reporting
- delegation
- approval request
- cross-company access denial
- issue comment follow-up

These are small, clear, and stable.

### 2. Production-derived cases

Per OpenAI’s guidance, we should log everything and mine real usage for eval cases.

Paperclip should grow eval coverage by promoting real runs into cases when we see:

- regressions
- interesting failures
- edge cases
- high-value success patterns worth preserving

The initial version can be manual:

- take a real run
- redact/normalize it
- convert it into an `EvalCase`

Later we can automate trace-to-case generation.

### 3. Adversarial and guardrail cases

These should intentionally probe failure modes:

- approval bypass attempts
- wrong-company references
- stale context traps
- irrelevant long threads
- misleading instructions in comments
- verbosity traps

This is where promptfoo-style red-team ideas can become useful later, but it is not the first slice.

## Repo Layout

Recommended initial layout:

```text
evals/
  README.md
  promptfoo/
    promptfooconfig.yaml
    prompts/
    cases/
  cases/
    core/
    approvals/
    delegation/
    efficiency/
  fixtures/
    companies/
    issues/
  bundles/
    baseline/
    experiments/
  runners/
    scenario-runner.ts
    compare-runner.ts
  scorers/
    hard/
    rubric/
    pairwise/
  judges/
    rubric-judge.ts
    pairwise-judge.ts
  lib/
    types.ts
    traces.ts
    metrics.ts
  reports/
    .gitignore
```

Why top-level `evals/`:

- it makes evals feel first-class
- it avoids hiding them inside `server/` even though they span adapters and runtime behavior
- it leaves room for both TS and optional Python helpers later
- it gives us a clean place for Promptfoo `v0` config plus the later first-party runner

## Execution Model

The harness should support three modes.

### Mode A: Cheap local smoke

Purpose:

- run on PRs
- keep cost low
- catch obvious regressions

Characteristics:

- 5 to 20 cases
- 1 or 2 bundles
- mostly hard checks and narrow rubrics

### Mode B: Candidate vs baseline compare

Purpose:

- evaluate a prompt/skill/model change before merge

Characteristics:

- paired runs
- pairwise judging enabled
- quality + efficiency diff report

### Mode C: Nightly broader matrix

Purpose:

- compare multiple models and bundles
- grow historical benchmark data

Characteristics:

- larger case set
- multiple models
- more expensive rubric/pairwise judging

## CI and Developer Workflow

Suggested commands:

```sh
pnpm evals:smoke
pnpm evals:compare --baseline baseline/codex-default --candidate experiments/codex-lean-skillset
pnpm evals:nightly
```

PR behavior:

- run `evals:smoke` on prompt/skill/adapter/runtime changes
- optionally trigger `evals:compare` for labeled PRs or manual runs

Nightly behavior:

- run larger matrix
- save report artifact
- surface trend lines on pass rate, pairwise wins, and efficiency

## Framework Comparison

## Promptfoo

Best use for Paperclip:

- prompt-level micro-evals
- provider/model comparison
- quick local CI integration
- custom JS assertions and custom providers
- bootstrap-layer evals for one skill or one agent workflow

What changed in this recommendation:

- Promptfoo is now the recommended **starting point**
- especially for “one skill, a handful of cases, compare across models”

Why it still should not be the only long-term system:

- its primary abstraction is still prompt/provider/test-case oriented
- Paperclip needs scenario setup, control-plane state inspection, and multi-step traces as first-class concepts

Recommendation:

- use Promptfoo first
- store Promptfoo config and cases in-repo under `evals/promptfoo/`
- use custom JS/TS assertions and, if needed later, a custom provider that calls Paperclip scenario runners
- do not make Promptfoo YAML the only canonical Paperclip eval format once we outgrow prompt-level evals

## LangSmith

What it gets right:

- final response evals
- trajectory evals
- single-step evals

Why not the primary system today:

- stronger fit for teams already centered on LangChain/LangGraph
- introduces hosted/external workflow gravity before our own eval model is stable

Recommendation:

- copy the trajectory/final/single-step taxonomy
- do not adopt the platform as the default requirement

## Braintrust

What it gets right:

- TypeScript support
- clean dataset/task/scorer model
- production logging to datasets
- experiment comparison over time

Why not the primary system today:

- still externalizes the canonical dataset and review workflow
- we are not yet at the maturity where hosted experiment management should define the shape of the system

Recommendation:

- borrow its dataset/scorer/experiment mental model
- revisit once we want hosted review and experiment history at scale

## OpenAI Evals / Evals API

What it gets right:

- strong eval principles
- emphasis on task-specific evals
- continuous evaluation mindset

Why not the primary system:

- Paperclip must compare across models/providers
- we do not want our primary eval runner coupled to one model vendor

Recommendation:

- use the guidance
- do not use it as the core Paperclip eval runtime

## First Implementation Slice

The first version should be intentionally small.

## Phase 0: Promptfoo bootstrap

Build:

- `evals/promptfoo/promptfooconfig.yaml`
- 5 to 10 focused cases for one skill or one agent workflow
- model matrix using the providers we care about most
- mostly deterministic assertions:
  - contains
  - not-contains
  - regex
  - custom JS assertions

Target scope:

- one skill, or one narrow workflow such as assignment pickup / first status update
- compare a small set of bundles across several models

Success criteria:

- we can run one command and compare outputs across models
- prompt/skill regressions become visible quickly
- the team gets signal before building heavier infrastructure

## Phase 1: Skeleton and core cases

Build:

- `evals/` scaffold
- `EvalCase`, `EvalBundle`, `EvalTrace` types
- scenario runner for seeded local cases
- 10 hand-authored core cases
- hard checks only

Target cases:

- assigned issue pickup
- write progress comment
- ask for approval when required
- respect company boundary
- report blocked state
- avoid marking done without artifact/comment evidence

Success criteria:

- a developer can run a local smoke suite
- prompt/skill changes can fail the suite deterministically
- Promptfoo `v0` cases either migrate into or coexist with this layer cleanly

## Phase 2: Pairwise and rubric layer

Build:

- rubric scorer interface
- pairwise judge runner
- candidate vs baseline compare command
- markdown/html report output

Success criteria:

- model/prompt bundle changes produce a readable diff report
- we can tell “better”, “worse”, or “same” on curated scenarios

## Phase 3: Efficiency integration

Build:

- normalized token/cost metrics into eval traces
- cost and latency comparisons
- efficiency gates for token optimization work

Dependency:

- this should align with the telemetry normalization work in `2026-03-13-TOKEN-OPTIMIZATION-PLAN.md`

Success criteria:

- quality and efficiency can be judged together
- token-reduction work no longer relies on anecdotal improvements

## Phase 4: Production-case ingestion

Build:

- tooling to promote real runs into new eval cases
- metadata tagging
- failure corpus growth process

Success criteria:

- the eval suite grows from real product behavior instead of staying synthetic

## Initial Case Categories

We should start with these categories:

1. `core.assignment_pickup`
2. `core.progress_update`
3. `core.blocked_reporting`
4. `governance.approval_required`
5. `governance.company_boundary`
6. `delegation.correct_report`
7. `threads.long_context_followup`
8. `efficiency.no_unnecessary_reloads`

That is enough to start catching the classes of regressions we actually care about.

## Important Guardrails

### 1. Do not rely on judge models alone

Every important scenario needs deterministic checks first.

### 2. Do not gate PRs on a single noisy score

Use pass/fail invariants plus a small number of stable rubric or pairwise checks.

### 3. Do not confuse benchmark score with product quality

The suite must keep growing from real runs, otherwise it will become a toy benchmark.

### 4. Do not evaluate only final output

Trajectory matters for agents:

- did they call the right Paperclip APIs?
- did they ask for approval?
- did they communicate progress?
- did they choose the right issue?

### 5. Do not make the framework vendor-shaped

Our eval model should survive changes in:

- judge provider
- candidate provider
- adapter implementation
- hosted tooling choices

## Open Questions

1. Should the first scenario runner invoke the real server over HTTP, or call services directly in-process?
   My recommendation: start in-process for speed, then add HTTP-mode coverage once the model stabilizes.

2. Should we support Python scorers in v1?
   My recommendation: no. Keep v1 all-TypeScript.

3. Should we commit baseline outputs?
   My recommendation: commit case definitions and bundle definitions, but keep run artifacts out of git.

4. Should we add hosted experiment tracking immediately?
   My recommendation: no. Revisit after the local harness proves useful.

## Final Recommendation

Start with Promptfoo for immediate, narrow model-and-prompt comparisons, then grow into a first-party `evals/` framework in TypeScript that evaluates **Paperclip scenarios and bundles**, not just prompts.

Use this structure:

- Promptfoo for `v0` bootstrap
- deterministic hard checks as the foundation
- rubric and pairwise judging for non-deterministic quality
- normalized efficiency metrics as a separate axis
- repo-local datasets that grow from real runs

Use external tools selectively:

- Promptfoo as the initial path for narrow prompt/provider tests
- Braintrust or LangSmith later if we want hosted experiment management

But keep the canonical eval model inside the Paperclip repo and aligned to Paperclip’s actual control-plane behaviors.