docs: add agent evals framework plan
This commit is contained in:
775
doc/plans/2026-03-13-agent-evals-framework.md
Normal file
775
doc/plans/2026-03-13-agent-evals-framework.md
Normal file
@@ -0,0 +1,775 @@
|
|||||||
|
# Agent Evals Framework Plan
|
||||||
|
|
||||||
|
Date: 2026-03-13
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
We need evals for the thing Paperclip actually ships:
|
||||||
|
|
||||||
|
- agent behavior produced by adapter config
|
||||||
|
- prompt templates and bootstrap prompts
|
||||||
|
- skill sets and skill instructions
|
||||||
|
- model choice
|
||||||
|
- runtime policy choices that affect outcomes and cost
|
||||||
|
|
||||||
|
We do **not** primarily need a fine-tuning pipeline.
|
||||||
|
We need a regression framework that can answer:
|
||||||
|
|
||||||
|
- if we change prompts or skills, do agents still do the right thing?
|
||||||
|
- if we switch models, what got better, worse, or more expensive?
|
||||||
|
- if we optimize tokens, did we preserve task outcomes?
|
||||||
|
- can we grow the suite over time from real Paperclip usage?
|
||||||
|
|
||||||
|
This plan is based on:
|
||||||
|
|
||||||
|
- `doc/GOAL.md`
|
||||||
|
- `doc/PRODUCT.md`
|
||||||
|
- `doc/SPEC-implementation.md`
|
||||||
|
- `docs/agents-runtime.md`
|
||||||
|
- `doc/plans/2026-03-13-TOKEN-OPTIMIZATION-PLAN.md`
|
||||||
|
- Discussion #449: <https://github.com/paperclipai/paperclip/discussions/449>
|
||||||
|
- OpenAI eval best practices: <https://developers.openai.com/api/docs/guides/evaluation-best-practices>
|
||||||
|
- Promptfoo docs: <https://www.promptfoo.dev/docs/configuration/test-cases/> and <https://www.promptfoo.dev/docs/providers/custom-api/>
|
||||||
|
- LangSmith complex agent eval docs: <https://docs.langchain.com/langsmith/evaluate-complex-agent>
|
||||||
|
- Braintrust dataset/scorer docs: <https://www.braintrust.dev/docs/annotate/datasets> and <https://www.braintrust.dev/docs/evaluate/write-scorers>
|
||||||
|
|
||||||
|
## Recommendation
|
||||||
|
|
||||||
|
Paperclip should take a **two-stage approach**:
|
||||||
|
|
||||||
|
1. **Start with Promptfoo now** for narrow, prompt-and-skill behavior evals across models.
|
||||||
|
2. **Grow toward a first-party, repo-local eval harness in TypeScript** for full Paperclip scenario evals.
|
||||||
|
|
||||||
|
So the recommendation is no longer “skip Promptfoo.” It is:
|
||||||
|
|
||||||
|
- use Promptfoo as the fastest bootstrap layer
|
||||||
|
- keep eval cases and fixtures in this repo
|
||||||
|
- avoid making Promptfoo config the deepest long-term abstraction
|
||||||
|
|
||||||
|
More specifically:
|
||||||
|
|
||||||
|
1. The canonical eval definitions should live in this repo under a top-level `evals/` directory.
|
||||||
|
2. `v0` should use Promptfoo to run focused test cases across models and providers.
|
||||||
|
3. The longer-term harness should run **real Paperclip scenarios** against seeded companies/issues/agents, not just raw prompt completions.
|
||||||
|
4. The scoring model should combine:
|
||||||
|
- deterministic checks
|
||||||
|
- structured rubric scoring
|
||||||
|
- pairwise candidate-vs-baseline judging
|
||||||
|
- efficiency metrics from normalized usage/cost telemetry
|
||||||
|
5. The framework should compare **bundles**, not just models.
|
||||||
|
|
||||||
|
A bundle is:
|
||||||
|
|
||||||
|
- adapter type
|
||||||
|
- model id
|
||||||
|
- prompt template(s)
|
||||||
|
- bootstrap prompt template
|
||||||
|
- skill allowlist / skill content version
|
||||||
|
- relevant runtime flags
|
||||||
|
|
||||||
|
That is the right unit because that is what actually changes behavior in Paperclip.
|
||||||
|
|
||||||
|
## Why This Is The Right Shape
|
||||||
|
|
||||||
|
### 1. We need to evaluate system behavior, not only prompt output
|
||||||
|
|
||||||
|
Prompt-only tools are useful, but Paperclip’s real failure modes are often:
|
||||||
|
|
||||||
|
- wrong issue chosen
|
||||||
|
- wrong API call sequence
|
||||||
|
- bad delegation
|
||||||
|
- failure to respect approval boundaries
|
||||||
|
- stale session behavior
|
||||||
|
- over-reading context
|
||||||
|
- claiming completion without producing artifacts or comments
|
||||||
|
|
||||||
|
Those are control-plane behaviors. They require scenario setup, execution, and trace inspection.
|
||||||
|
|
||||||
|
### 2. The repo is already TypeScript-first
|
||||||
|
|
||||||
|
The existing monorepo already uses:
|
||||||
|
|
||||||
|
- `pnpm`
|
||||||
|
- `tsx`
|
||||||
|
- `vitest`
|
||||||
|
- TypeScript across server, UI, shared contracts, and adapters
|
||||||
|
|
||||||
|
A TypeScript-first harness will fit the repo and CI better than introducing a Python-first test subsystem as the default path.
|
||||||
|
|
||||||
|
Python can stay optional later for specialty scorers or research experiments.
|
||||||
|
|
||||||
|
### 3. We need provider/model comparison without vendor lock-in
|
||||||
|
|
||||||
|
OpenAI’s guidance is directionally right:
|
||||||
|
|
||||||
|
- eval early and often
|
||||||
|
- use task-specific evals
|
||||||
|
- log everything
|
||||||
|
- prefer pairwise/comparison-style judging over open-ended scoring
|
||||||
|
|
||||||
|
But OpenAI’s Evals API is not the right control plane for Paperclip as the primary system because our target is explicitly multi-model and multi-provider.
|
||||||
|
|
||||||
|
### 4. Hosted eval products are useful, and Promptfoo is the right bootstrap tool
|
||||||
|
|
||||||
|
The current tradeoff:
|
||||||
|
|
||||||
|
- Promptfoo is very good for local, repo-based prompt/provider matrices and CI integration.
|
||||||
|
- LangSmith is strong on trajectory-style agent evals.
|
||||||
|
- Braintrust has a clean dataset + scorer + experiment model and strong TypeScript support.
|
||||||
|
|
||||||
|
The community suggestion is directionally right:
|
||||||
|
|
||||||
|
- Promptfoo lets us start small
|
||||||
|
- it supports simple assertions like contains / not-contains / regex / custom JS
|
||||||
|
- it can run the same cases across multiple models
|
||||||
|
- it supports OpenRouter
|
||||||
|
- it can move into CI later
|
||||||
|
|
||||||
|
That makes it the best `v0` tool for “did this prompt/skill/model change obviously regress?”
|
||||||
|
|
||||||
|
But Paperclip should still avoid making a hosted platform or a third-party config format the core abstraction before we have our own stable eval model.
|
||||||
|
|
||||||
|
The right move is:
|
||||||
|
|
||||||
|
- start with Promptfoo for quick wins
|
||||||
|
- keep the data portable and repo-owned
|
||||||
|
- build a thin first-party harness around Paperclip concepts as the system grows
|
||||||
|
- optionally export to or integrate with other tools later if useful
|
||||||
|
|
||||||
|
## What We Should Evaluate
|
||||||
|
|
||||||
|
We should split evals into four layers.
|
||||||
|
|
||||||
|
### Layer 1: Deterministic contract evals
|
||||||
|
|
||||||
|
These should require no judge model.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
|
||||||
|
- agent comments on the assigned issue
|
||||||
|
- no mutation outside the agent’s company
|
||||||
|
- approval-required actions do not bypass approval flow
|
||||||
|
- task transitions are legal
|
||||||
|
- output contains required structured fields
|
||||||
|
- artifact links exist when the task required an artifact
|
||||||
|
- no full-thread refetch on delta-only cases once the API supports it
|
||||||
|
|
||||||
|
These are cheap, reliable, and should be the first line of defense.
|
||||||
|
|
||||||
|
### Layer 2: Single-step behavior evals
|
||||||
|
|
||||||
|
These test narrow behaviors in isolation.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
|
||||||
|
- chooses the correct issue from inbox
|
||||||
|
- writes a reasonable first status comment
|
||||||
|
- decides to ask for approval instead of acting directly
|
||||||
|
- delegates to the correct report
|
||||||
|
- recognizes blocked state and reports it clearly
|
||||||
|
|
||||||
|
These are the closest thing to prompt evals, but still framed in Paperclip terms.
|
||||||
|
|
||||||
|
### Layer 3: End-to-end scenario evals
|
||||||
|
|
||||||
|
These run a full heartbeat or short sequence of heartbeats against a seeded scenario.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
|
||||||
|
- new assignment pickup
|
||||||
|
- long-thread continuation
|
||||||
|
- mention-triggered clarification
|
||||||
|
- approval-gated hire request
|
||||||
|
- manager escalation
|
||||||
|
- workspace coding task that must leave a meaningful issue update
|
||||||
|
|
||||||
|
These should evaluate both final state and trace quality.
|
||||||
|
|
||||||
|
### Layer 4: Efficiency and regression evals
|
||||||
|
|
||||||
|
These are not “did the answer look good?” evals. They are “did we preserve quality while improving cost/latency?” evals.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
|
||||||
|
- normalized input tokens per successful heartbeat
|
||||||
|
- normalized tokens per completed issue
|
||||||
|
- session reuse rate
|
||||||
|
- full-thread reload rate
|
||||||
|
- wall-clock duration
|
||||||
|
- cost per successful scenario
|
||||||
|
|
||||||
|
This layer is especially important for token optimization work.
|
||||||
|
|
||||||
|
## Core Design
|
||||||
|
|
||||||
|
## 1. Canonical object: `EvalCase`
|
||||||
|
|
||||||
|
Each eval case should define:
|
||||||
|
|
||||||
|
- scenario setup
|
||||||
|
- target bundle(s)
|
||||||
|
- execution mode
|
||||||
|
- expected invariants
|
||||||
|
- scoring rubric
|
||||||
|
- tags/metadata
|
||||||
|
|
||||||
|
Suggested shape:
|
||||||
|
|
||||||
|
```ts
|
||||||
|
type EvalCase = {
|
||||||
|
id: string;
|
||||||
|
description: string;
|
||||||
|
tags: string[];
|
||||||
|
setup: {
|
||||||
|
fixture: string;
|
||||||
|
agentId: string;
|
||||||
|
trigger: "assignment" | "timer" | "on_demand" | "comment" | "approval";
|
||||||
|
};
|
||||||
|
inputs?: Record<string, unknown>;
|
||||||
|
checks: {
|
||||||
|
hard: HardCheck[];
|
||||||
|
rubric?: RubricCheck[];
|
||||||
|
pairwise?: PairwiseCheck[];
|
||||||
|
};
|
||||||
|
metrics: MetricSpec[];
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
The important part is that the case is about a Paperclip scenario, not a standalone prompt string.
|
||||||
|
|
||||||
|
## 2. Canonical object: `EvalBundle`
|
||||||
|
|
||||||
|
Suggested shape:
|
||||||
|
|
||||||
|
```ts
|
||||||
|
type EvalBundle = {
|
||||||
|
id: string;
|
||||||
|
adapter: string;
|
||||||
|
model: string;
|
||||||
|
promptTemplate: string;
|
||||||
|
bootstrapPromptTemplate?: string;
|
||||||
|
skills: string[];
|
||||||
|
flags?: Record<string, string | number | boolean>;
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
Every comparison run should say which bundle was tested.
|
||||||
|
|
||||||
|
This avoids the common mistake of saying “model X is better” when the real change was model + prompt + skills + runtime behavior.
|
||||||
|
|
||||||
|
## 3. Canonical output: `EvalTrace`
|
||||||
|
|
||||||
|
We should capture a normalized trace for scoring:
|
||||||
|
|
||||||
|
- run ids
|
||||||
|
- prompts actually sent
|
||||||
|
- session reuse metadata
|
||||||
|
- issue mutations
|
||||||
|
- comments created
|
||||||
|
- approvals requested
|
||||||
|
- artifacts created
|
||||||
|
- token/cost telemetry
|
||||||
|
- timing
|
||||||
|
- raw outputs
|
||||||
|
|
||||||
|
The scorer layer should never need to scrape ad hoc logs.
|
||||||
|
|
||||||
|
## Scoring Framework
|
||||||
|
|
||||||
|
## 1. Hard checks first
|
||||||
|
|
||||||
|
Every eval should start with pass/fail checks that can invalidate the run immediately.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
|
||||||
|
- touched wrong company
|
||||||
|
- skipped required approval
|
||||||
|
- no issue update produced
|
||||||
|
- returned malformed structured output
|
||||||
|
- marked task done without required artifact
|
||||||
|
|
||||||
|
If a hard check fails, the scenario fails regardless of style or judge score.
|
||||||
|
|
||||||
|
## 2. Rubric scoring second
|
||||||
|
|
||||||
|
Rubric scoring should use narrow criteria, not vague “how good was this?” prompts.
|
||||||
|
|
||||||
|
Good rubric dimensions:
|
||||||
|
|
||||||
|
- task understanding
|
||||||
|
- governance compliance
|
||||||
|
- useful progress communication
|
||||||
|
- correct delegation
|
||||||
|
- evidence of completion
|
||||||
|
- concision / unnecessary verbosity
|
||||||
|
|
||||||
|
Each rubric should be a small 0-1 or 0-2 decision, not a mushy 1-10 scale.
|
||||||
|
|
||||||
|
## 3. Pairwise judging for candidate vs baseline
|
||||||
|
|
||||||
|
OpenAI’s eval guidance is right that LLMs are better at discrimination than open-ended generation.
|
||||||
|
|
||||||
|
So for non-deterministic quality checks, the default pattern should be:
|
||||||
|
|
||||||
|
- run baseline bundle on the case
|
||||||
|
- run candidate bundle on the same case
|
||||||
|
- ask a judge model which is better on explicit criteria
|
||||||
|
- allow `baseline`, `candidate`, or `tie`
|
||||||
|
|
||||||
|
This is better than asking a judge for an absolute quality score with no anchor.
|
||||||
|
|
||||||
|
## 4. Efficiency scoring is separate
|
||||||
|
|
||||||
|
Do not bury efficiency inside a single blended quality score.
|
||||||
|
|
||||||
|
Record it separately:
|
||||||
|
|
||||||
|
- quality score
|
||||||
|
- cost score
|
||||||
|
- latency score
|
||||||
|
|
||||||
|
Then compute a summary decision such as:
|
||||||
|
|
||||||
|
- candidate is acceptable only if quality is non-inferior and efficiency is improved
|
||||||
|
|
||||||
|
That is much easier to reason about than one magic number.
|
||||||
|
|
||||||
|
## Suggested Decision Rule
|
||||||
|
|
||||||
|
For PR gating:
|
||||||
|
|
||||||
|
1. No hard-check regressions.
|
||||||
|
2. No significant regression on required scenario pass rate.
|
||||||
|
3. No significant regression on key rubric dimensions.
|
||||||
|
4. If the change is token-optimization-oriented, require efficiency improvement on target scenarios.
|
||||||
|
|
||||||
|
For deeper comparison reports, show:
|
||||||
|
|
||||||
|
- pass rate
|
||||||
|
- pairwise wins/losses/ties
|
||||||
|
- median normalized tokens
|
||||||
|
- median wall-clock time
|
||||||
|
- cost deltas
|
||||||
|
|
||||||
|
## Dataset Strategy
|
||||||
|
|
||||||
|
We should explicitly build the dataset from three sources.
|
||||||
|
|
||||||
|
### 1. Hand-authored seed cases
|
||||||
|
|
||||||
|
Start here.
|
||||||
|
|
||||||
|
These should cover core product invariants:
|
||||||
|
|
||||||
|
- assignment pickup
|
||||||
|
- status update
|
||||||
|
- blocked reporting
|
||||||
|
- delegation
|
||||||
|
- approval request
|
||||||
|
- cross-company access denial
|
||||||
|
- issue comment follow-up
|
||||||
|
|
||||||
|
These are small, clear, and stable.
|
||||||
|
|
||||||
|
### 2. Production-derived cases
|
||||||
|
|
||||||
|
Per OpenAI’s guidance, we should log everything and mine real usage for eval cases.
|
||||||
|
|
||||||
|
Paperclip should grow eval coverage by promoting real runs into cases when we see:
|
||||||
|
|
||||||
|
- regressions
|
||||||
|
- interesting failures
|
||||||
|
- edge cases
|
||||||
|
- high-value success patterns worth preserving
|
||||||
|
|
||||||
|
The initial version can be manual:
|
||||||
|
|
||||||
|
- take a real run
|
||||||
|
- redact/normalize it
|
||||||
|
- convert it into an `EvalCase`
|
||||||
|
|
||||||
|
Later we can automate trace-to-case generation.
|
||||||
|
|
||||||
|
### 3. Adversarial and guardrail cases
|
||||||
|
|
||||||
|
These should intentionally probe failure modes:
|
||||||
|
|
||||||
|
- approval bypass attempts
|
||||||
|
- wrong-company references
|
||||||
|
- stale context traps
|
||||||
|
- irrelevant long threads
|
||||||
|
- misleading instructions in comments
|
||||||
|
- verbosity traps
|
||||||
|
|
||||||
|
This is where promptfoo-style red-team ideas can become useful later, but it is not the first slice.
|
||||||
|
|
||||||
|
## Repo Layout
|
||||||
|
|
||||||
|
Recommended initial layout:
|
||||||
|
|
||||||
|
```text
|
||||||
|
evals/
|
||||||
|
README.md
|
||||||
|
promptfoo/
|
||||||
|
promptfooconfig.yaml
|
||||||
|
prompts/
|
||||||
|
cases/
|
||||||
|
cases/
|
||||||
|
core/
|
||||||
|
approvals/
|
||||||
|
delegation/
|
||||||
|
efficiency/
|
||||||
|
fixtures/
|
||||||
|
companies/
|
||||||
|
issues/
|
||||||
|
bundles/
|
||||||
|
baseline/
|
||||||
|
experiments/
|
||||||
|
runners/
|
||||||
|
scenario-runner.ts
|
||||||
|
compare-runner.ts
|
||||||
|
scorers/
|
||||||
|
hard/
|
||||||
|
rubric/
|
||||||
|
pairwise/
|
||||||
|
judges/
|
||||||
|
rubric-judge.ts
|
||||||
|
pairwise-judge.ts
|
||||||
|
lib/
|
||||||
|
types.ts
|
||||||
|
traces.ts
|
||||||
|
metrics.ts
|
||||||
|
reports/
|
||||||
|
.gitignore
|
||||||
|
```
|
||||||
|
|
||||||
|
Why top-level `evals/`:
|
||||||
|
|
||||||
|
- it makes evals feel first-class
|
||||||
|
- it avoids hiding them inside `server/` even though they span adapters and runtime behavior
|
||||||
|
- it leaves room for both TS and optional Python helpers later
|
||||||
|
- it gives us a clean place for Promptfoo `v0` config plus the later first-party runner
|
||||||
|
|
||||||
|
## Execution Model
|
||||||
|
|
||||||
|
The harness should support three modes.
|
||||||
|
|
||||||
|
### Mode A: Cheap local smoke
|
||||||
|
|
||||||
|
Purpose:
|
||||||
|
|
||||||
|
- run on PRs
|
||||||
|
- keep cost low
|
||||||
|
- catch obvious regressions
|
||||||
|
|
||||||
|
Characteristics:
|
||||||
|
|
||||||
|
- 5 to 20 cases
|
||||||
|
- 1 or 2 bundles
|
||||||
|
- mostly hard checks and narrow rubrics
|
||||||
|
|
||||||
|
### Mode B: Candidate vs baseline compare
|
||||||
|
|
||||||
|
Purpose:
|
||||||
|
|
||||||
|
- evaluate a prompt/skill/model change before merge
|
||||||
|
|
||||||
|
Characteristics:
|
||||||
|
|
||||||
|
- paired runs
|
||||||
|
- pairwise judging enabled
|
||||||
|
- quality + efficiency diff report
|
||||||
|
|
||||||
|
### Mode C: Nightly broader matrix
|
||||||
|
|
||||||
|
Purpose:
|
||||||
|
|
||||||
|
- compare multiple models and bundles
|
||||||
|
- grow historical benchmark data
|
||||||
|
|
||||||
|
Characteristics:
|
||||||
|
|
||||||
|
- larger case set
|
||||||
|
- multiple models
|
||||||
|
- more expensive rubric/pairwise judging
|
||||||
|
|
||||||
|
## CI and Developer Workflow
|
||||||
|
|
||||||
|
Suggested commands:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
pnpm evals:smoke
|
||||||
|
pnpm evals:compare --baseline baseline/codex-default --candidate experiments/codex-lean-skillset
|
||||||
|
pnpm evals:nightly
|
||||||
|
```
|
||||||
|
|
||||||
|
PR behavior:
|
||||||
|
|
||||||
|
- run `evals:smoke` on prompt/skill/adapter/runtime changes
|
||||||
|
- optionally trigger `evals:compare` for labeled PRs or manual runs
|
||||||
|
|
||||||
|
Nightly behavior:
|
||||||
|
|
||||||
|
- run larger matrix
|
||||||
|
- save report artifact
|
||||||
|
- surface trend lines on pass rate, pairwise wins, and efficiency
|
||||||
|
|
||||||
|
## Framework Comparison
|
||||||
|
|
||||||
|
## Promptfoo
|
||||||
|
|
||||||
|
Best use for Paperclip:
|
||||||
|
|
||||||
|
- prompt-level micro-evals
|
||||||
|
- provider/model comparison
|
||||||
|
- quick local CI integration
|
||||||
|
- custom JS assertions and custom providers
|
||||||
|
- bootstrap-layer evals for one skill or one agent workflow
|
||||||
|
|
||||||
|
What changed in this recommendation:
|
||||||
|
|
||||||
|
- Promptfoo is now the recommended **starting point**
|
||||||
|
- especially for “one skill, a handful of cases, compare across models”
|
||||||
|
|
||||||
|
Why it still should not be the only long-term system:
|
||||||
|
|
||||||
|
- its primary abstraction is still prompt/provider/test-case oriented
|
||||||
|
- Paperclip needs scenario setup, control-plane state inspection, and multi-step traces as first-class concepts
|
||||||
|
|
||||||
|
Recommendation:
|
||||||
|
|
||||||
|
- use Promptfoo first
|
||||||
|
- store Promptfoo config and cases in-repo under `evals/promptfoo/`
|
||||||
|
- use custom JS/TS assertions and, if needed later, a custom provider that calls Paperclip scenario runners
|
||||||
|
- do not make Promptfoo YAML the only canonical Paperclip eval format once we outgrow prompt-level evals
|
||||||
|
|
||||||
|
## LangSmith
|
||||||
|
|
||||||
|
What it gets right:
|
||||||
|
|
||||||
|
- final response evals
|
||||||
|
- trajectory evals
|
||||||
|
- single-step evals
|
||||||
|
|
||||||
|
Why not the primary system today:
|
||||||
|
|
||||||
|
- stronger fit for teams already centered on LangChain/LangGraph
|
||||||
|
- introduces hosted/external workflow gravity before our own eval model is stable
|
||||||
|
|
||||||
|
Recommendation:
|
||||||
|
|
||||||
|
- copy the trajectory/final/single-step taxonomy
|
||||||
|
- do not adopt the platform as the default requirement
|
||||||
|
|
||||||
|
## Braintrust
|
||||||
|
|
||||||
|
What it gets right:
|
||||||
|
|
||||||
|
- TypeScript support
|
||||||
|
- clean dataset/task/scorer model
|
||||||
|
- production logging to datasets
|
||||||
|
- experiment comparison over time
|
||||||
|
|
||||||
|
Why not the primary system today:
|
||||||
|
|
||||||
|
- still externalizes the canonical dataset and review workflow
|
||||||
|
- we are not yet at the maturity where hosted experiment management should define the shape of the system
|
||||||
|
|
||||||
|
Recommendation:
|
||||||
|
|
||||||
|
- borrow its dataset/scorer/experiment mental model
|
||||||
|
- revisit once we want hosted review and experiment history at scale
|
||||||
|
|
||||||
|
## OpenAI Evals / Evals API
|
||||||
|
|
||||||
|
What it gets right:
|
||||||
|
|
||||||
|
- strong eval principles
|
||||||
|
- emphasis on task-specific evals
|
||||||
|
- continuous evaluation mindset
|
||||||
|
|
||||||
|
Why not the primary system:
|
||||||
|
|
||||||
|
- Paperclip must compare across models/providers
|
||||||
|
- we do not want our primary eval runner coupled to one model vendor
|
||||||
|
|
||||||
|
Recommendation:
|
||||||
|
|
||||||
|
- use the guidance
|
||||||
|
- do not use it as the core Paperclip eval runtime
|
||||||
|
|
||||||
|
## First Implementation Slice
|
||||||
|
|
||||||
|
The first version should be intentionally small.
|
||||||
|
|
||||||
|
## Phase 0: Promptfoo bootstrap
|
||||||
|
|
||||||
|
Build:
|
||||||
|
|
||||||
|
- `evals/promptfoo/promptfooconfig.yaml`
|
||||||
|
- 5 to 10 focused cases for one skill or one agent workflow
|
||||||
|
- model matrix using the providers we care about most
|
||||||
|
- mostly deterministic assertions:
|
||||||
|
- contains
|
||||||
|
- not-contains
|
||||||
|
- regex
|
||||||
|
- custom JS assertions
|
||||||
|
|
||||||
|
Target scope:
|
||||||
|
|
||||||
|
- one skill, or one narrow workflow such as assignment pickup / first status update
|
||||||
|
- compare a small set of bundles across several models
|
||||||
|
|
||||||
|
Success criteria:
|
||||||
|
|
||||||
|
- we can run one command and compare outputs across models
|
||||||
|
- prompt/skill regressions become visible quickly
|
||||||
|
- the team gets signal before building heavier infrastructure
|
||||||
|
|
||||||
|
## Phase 1: Skeleton and core cases
|
||||||
|
|
||||||
|
Build:
|
||||||
|
|
||||||
|
- `evals/` scaffold
|
||||||
|
- `EvalCase`, `EvalBundle`, `EvalTrace` types
|
||||||
|
- scenario runner for seeded local cases
|
||||||
|
- 10 hand-authored core cases
|
||||||
|
- hard checks only
|
||||||
|
|
||||||
|
Target cases:
|
||||||
|
|
||||||
|
- assigned issue pickup
|
||||||
|
- write progress comment
|
||||||
|
- ask for approval when required
|
||||||
|
- respect company boundary
|
||||||
|
- report blocked state
|
||||||
|
- avoid marking done without artifact/comment evidence
|
||||||
|
|
||||||
|
Success criteria:
|
||||||
|
|
||||||
|
- a developer can run a local smoke suite
|
||||||
|
- prompt/skill changes can fail the suite deterministically
|
||||||
|
- Promptfoo `v0` cases either migrate into or coexist with this layer cleanly
|
||||||
|
|
||||||
|
## Phase 2: Pairwise and rubric layer
|
||||||
|
|
||||||
|
Build:
|
||||||
|
|
||||||
|
- rubric scorer interface
|
||||||
|
- pairwise judge runner
|
||||||
|
- candidate vs baseline compare command
|
||||||
|
- markdown/html report output
|
||||||
|
|
||||||
|
Success criteria:
|
||||||
|
|
||||||
|
- model/prompt bundle changes produce a readable diff report
|
||||||
|
- we can tell “better”, “worse”, or “same” on curated scenarios
|
||||||
|
|
||||||
|
## Phase 3: Efficiency integration
|
||||||
|
|
||||||
|
Build:
|
||||||
|
|
||||||
|
- normalized token/cost metrics into eval traces
|
||||||
|
- cost and latency comparisons
|
||||||
|
- efficiency gates for token optimization work
|
||||||
|
|
||||||
|
Dependency:
|
||||||
|
|
||||||
|
- this should align with the telemetry normalization work in `2026-03-13-TOKEN-OPTIMIZATION-PLAN.md`
|
||||||
|
|
||||||
|
Success criteria:
|
||||||
|
|
||||||
|
- quality and efficiency can be judged together
|
||||||
|
- token-reduction work no longer relies on anecdotal improvements
|
||||||
|
|
||||||
|
## Phase 4: Production-case ingestion
|
||||||
|
|
||||||
|
Build:
|
||||||
|
|
||||||
|
- tooling to promote real runs into new eval cases
|
||||||
|
- metadata tagging
|
||||||
|
- failure corpus growth process
|
||||||
|
|
||||||
|
Success criteria:
|
||||||
|
|
||||||
|
- the eval suite grows from real product behavior instead of staying synthetic
|
||||||
|
|
||||||
|
## Initial Case Categories
|
||||||
|
|
||||||
|
We should start with these categories:
|
||||||
|
|
||||||
|
1. `core.assignment_pickup`
|
||||||
|
2. `core.progress_update`
|
||||||
|
3. `core.blocked_reporting`
|
||||||
|
4. `governance.approval_required`
|
||||||
|
5. `governance.company_boundary`
|
||||||
|
6. `delegation.correct_report`
|
||||||
|
7. `threads.long_context_followup`
|
||||||
|
8. `efficiency.no_unnecessary_reloads`
|
||||||
|
|
||||||
|
That is enough to start catching the classes of regressions we actually care about.
|
||||||
|
|
||||||
|
## Important Guardrails
|
||||||
|
|
||||||
|
### 1. Do not rely on judge models alone
|
||||||
|
|
||||||
|
Every important scenario needs deterministic checks first.
|
||||||
|
|
||||||
|
### 2. Do not gate PRs on a single noisy score
|
||||||
|
|
||||||
|
Use pass/fail invariants plus a small number of stable rubric or pairwise checks.
|
||||||
|
|
||||||
|
### 3. Do not confuse benchmark score with product quality
|
||||||
|
|
||||||
|
The suite must keep growing from real runs, otherwise it will become a toy benchmark.
|
||||||
|
|
||||||
|
### 4. Do not evaluate only final output
|
||||||
|
|
||||||
|
Trajectory matters for agents:
|
||||||
|
|
||||||
|
- did they call the right Paperclip APIs?
|
||||||
|
- did they ask for approval?
|
||||||
|
- did they communicate progress?
|
||||||
|
- did they choose the right issue?
|
||||||
|
|
||||||
|
### 5. Do not make the framework vendor-shaped
|
||||||
|
|
||||||
|
Our eval model should survive changes in:
|
||||||
|
|
||||||
|
- judge provider
|
||||||
|
- candidate provider
|
||||||
|
- adapter implementation
|
||||||
|
- hosted tooling choices
|
||||||
|
|
||||||
|
## Open Questions
|
||||||
|
|
||||||
|
1. Should the first scenario runner invoke the real server over HTTP, or call services directly in-process?
|
||||||
|
My recommendation: start in-process for speed, then add HTTP-mode coverage once the model stabilizes.
|
||||||
|
|
||||||
|
2. Should we support Python scorers in v1?
|
||||||
|
My recommendation: no. Keep v1 all-TypeScript.
|
||||||
|
|
||||||
|
3. Should we commit baseline outputs?
|
||||||
|
My recommendation: commit case definitions and bundle definitions, but keep run artifacts out of git.
|
||||||
|
|
||||||
|
4. Should we add hosted experiment tracking immediately?
|
||||||
|
My recommendation: no. Revisit after the local harness proves useful.
|
||||||
|
|
||||||
|
## Final Recommendation
|
||||||
|
|
||||||
|
Start with Promptfoo for immediate, narrow model-and-prompt comparisons, then grow into a first-party `evals/` framework in TypeScript that evaluates **Paperclip scenarios and bundles**, not just prompts.
|
||||||
|
|
||||||
|
Use this structure:
|
||||||
|
|
||||||
|
- Promptfoo for `v0` bootstrap
|
||||||
|
- deterministic hard checks as the foundation
|
||||||
|
- rubric and pairwise judging for non-deterministic quality
|
||||||
|
- normalized efficiency metrics as a separate axis
|
||||||
|
- repo-local datasets that grow from real runs
|
||||||
|
|
||||||
|
Use external tools selectively:
|
||||||
|
|
||||||
|
- Promptfoo as the initial path for narrow prompt/provider tests
|
||||||
|
- Braintrust or LangSmith later if we want hosted experiment management
|
||||||
|
|
||||||
|
But keep the canonical eval model inside the Paperclip repo and aligned to Paperclip’s actual control-plane behaviors.
|
||||||
Reference in New Issue
Block a user