Merge public-gh/master into feature/plugin-runtime-instance-cleanup
This commit is contained in:
@@ -142,7 +142,7 @@ This command:
|
||||
- creates an isolated instance under `~/.paperclip-worktrees/instances/<worktree-id>/`
|
||||
- when run inside a linked git worktree, mirrors the effective git hooks into that worktree's private git dir
|
||||
- picks a free app port and embedded PostgreSQL port
|
||||
- by default seeds the isolated DB in `minimal` mode from your main instance via a logical SQL snapshot
|
||||
- by default seeds the isolated DB in `minimal` mode from the current effective Paperclip instance/config (repo-local worktree config when present, otherwise the default instance) via a logical SQL snapshot
|
||||
|
||||
Seed modes:
|
||||
|
||||
|
||||
@@ -330,6 +330,34 @@ Operational policy:
|
||||
- `asset_id` uuid fk not null
|
||||
- `issue_comment_id` uuid fk null
|
||||
|
||||
## 7.15 `documents` + `document_revisions` + `issue_documents`
|
||||
|
||||
- `documents` stores editable text-first documents:
|
||||
- `id` uuid pk
|
||||
- `company_id` uuid fk not null
|
||||
- `title` text null
|
||||
- `format` text not null (`markdown`)
|
||||
- `latest_body` text not null
|
||||
- `latest_revision_id` uuid null
|
||||
- `latest_revision_number` int not null
|
||||
- `created_by_agent_id` uuid fk null
|
||||
- `created_by_user_id` uuid/text fk null
|
||||
- `updated_by_agent_id` uuid fk null
|
||||
- `updated_by_user_id` uuid/text fk null
|
||||
- `document_revisions` stores append-only history:
|
||||
- `id` uuid pk
|
||||
- `company_id` uuid fk not null
|
||||
- `document_id` uuid fk not null
|
||||
- `revision_number` int not null
|
||||
- `body` text not null
|
||||
- `change_summary` text null
|
||||
- `issue_documents` links documents to issues with a stable workflow key:
|
||||
- `id` uuid pk
|
||||
- `company_id` uuid fk not null
|
||||
- `issue_id` uuid fk not null
|
||||
- `document_id` uuid fk not null
|
||||
- `key` text not null (`plan`, `design`, `notes`, etc.)
|
||||
|
||||
## 8. State Machines
|
||||
|
||||
## 8.1 Agent Status
|
||||
@@ -441,6 +469,11 @@ All endpoints are under `/api` and return JSON.
|
||||
- `POST /companies/:companyId/issues`
|
||||
- `GET /issues/:issueId`
|
||||
- `PATCH /issues/:issueId`
|
||||
- `GET /issues/:issueId/documents`
|
||||
- `GET /issues/:issueId/documents/:key`
|
||||
- `PUT /issues/:issueId/documents/:key`
|
||||
- `GET /issues/:issueId/documents/:key/revisions`
|
||||
- `DELETE /issues/:issueId/documents/:key`
|
||||
- `POST /issues/:issueId/checkout`
|
||||
- `POST /issues/:issueId/release`
|
||||
- `POST /issues/:issueId/comments`
|
||||
|
||||
@@ -118,10 +118,18 @@ Result:
|
||||
|
||||
Local adapters inject repo skills into runtime skill directories.
|
||||
|
||||
Important `codex_local` nuance:
|
||||
|
||||
- Codex does not read skills directly from the active worktree.
|
||||
- Paperclip discovers repo skills from the current checkout, then symlinks them into `$CODEX_HOME/skills` or `~/.codex/skills`.
|
||||
- If an existing Paperclip skill symlink already points at another live checkout, the current implementation skips it instead of repointing it.
|
||||
- This can leave Codex using stale skill content from a different worktree even after Paperclip-side skill changes land.
|
||||
- That is both a correctness risk and a token-analysis risk, because runtime behavior may not reflect the instructions in the checkout being tested.
|
||||
|
||||
Current repo skill sizes:
|
||||
|
||||
- `skills/paperclip/SKILL.md`: 17,441 bytes
|
||||
- `skills/create-agent-adapter/SKILL.md`: 31,832 bytes
|
||||
- `.agents/skills/create-agent-adapter/SKILL.md`: 31,832 bytes
|
||||
- `skills/paperclip-create-agent/SKILL.md`: 4,718 bytes
|
||||
- `skills/para-memory-files/SKILL.md`: 3,978 bytes
|
||||
|
||||
@@ -215,6 +223,8 @@ This is the right version of the discussion’s bootstrap idea.
|
||||
|
||||
Static instructions and dynamic wake context have different cache behavior and should be modeled separately.
|
||||
|
||||
For `codex_local`, this also requires isolating the Codex skill home per worktree or teaching Paperclip to repoint its own skill symlinks when the source checkout changes. Otherwise prompt and skill improvements in the active worktree may not reach the running agent.
|
||||
|
||||
### Success criteria
|
||||
|
||||
- fresh-session prompts can remain richer without inflating every resumed heartbeat
|
||||
@@ -305,6 +315,9 @@ Even when reuse is desirable, some sessions become too expensive to keep alive i
|
||||
- `para-memory-files`
|
||||
- `create-agent-adapter`
|
||||
- Expose active skill set in agent config and run metadata.
|
||||
- For `codex_local`, either:
|
||||
- run with a worktree-specific `CODEX_HOME`, or
|
||||
- treat Paperclip-owned Codex skill symlinks as repairable when they point at a different checkout
|
||||
|
||||
### Why
|
||||
|
||||
@@ -363,6 +376,7 @@ Initial targets:
|
||||
6. Rewrite `skills/paperclip/SKILL.md` around delta-fetch behavior.
|
||||
7. Add session rotation with carry-forward summaries.
|
||||
8. Replace global skill injection with explicit allowlists.
|
||||
9. Fix `codex_local` skill resolution so worktree-local skill changes reliably reach the runtime.
|
||||
|
||||
## Recommendation
|
||||
|
||||
|
||||
775
doc/plans/2026-03-13-agent-evals-framework.md
Normal file
775
doc/plans/2026-03-13-agent-evals-framework.md
Normal file
@@ -0,0 +1,775 @@
|
||||
# Agent Evals Framework Plan
|
||||
|
||||
Date: 2026-03-13
|
||||
|
||||
## Context
|
||||
|
||||
We need evals for the thing Paperclip actually ships:
|
||||
|
||||
- agent behavior produced by adapter config
|
||||
- prompt templates and bootstrap prompts
|
||||
- skill sets and skill instructions
|
||||
- model choice
|
||||
- runtime policy choices that affect outcomes and cost
|
||||
|
||||
We do **not** primarily need a fine-tuning pipeline.
|
||||
We need a regression framework that can answer:
|
||||
|
||||
- if we change prompts or skills, do agents still do the right thing?
|
||||
- if we switch models, what got better, worse, or more expensive?
|
||||
- if we optimize tokens, did we preserve task outcomes?
|
||||
- can we grow the suite over time from real Paperclip usage?
|
||||
|
||||
This plan is based on:
|
||||
|
||||
- `doc/GOAL.md`
|
||||
- `doc/PRODUCT.md`
|
||||
- `doc/SPEC-implementation.md`
|
||||
- `docs/agents-runtime.md`
|
||||
- `doc/plans/2026-03-13-TOKEN-OPTIMIZATION-PLAN.md`
|
||||
- Discussion #449: <https://github.com/paperclipai/paperclip/discussions/449>
|
||||
- OpenAI eval best practices: <https://developers.openai.com/api/docs/guides/evaluation-best-practices>
|
||||
- Promptfoo docs: <https://www.promptfoo.dev/docs/configuration/test-cases/> and <https://www.promptfoo.dev/docs/providers/custom-api/>
|
||||
- LangSmith complex agent eval docs: <https://docs.langchain.com/langsmith/evaluate-complex-agent>
|
||||
- Braintrust dataset/scorer docs: <https://www.braintrust.dev/docs/annotate/datasets> and <https://www.braintrust.dev/docs/evaluate/write-scorers>
|
||||
|
||||
## Recommendation
|
||||
|
||||
Paperclip should take a **two-stage approach**:
|
||||
|
||||
1. **Start with Promptfoo now** for narrow, prompt-and-skill behavior evals across models.
|
||||
2. **Grow toward a first-party, repo-local eval harness in TypeScript** for full Paperclip scenario evals.
|
||||
|
||||
So the recommendation is no longer “skip Promptfoo.” It is:
|
||||
|
||||
- use Promptfoo as the fastest bootstrap layer
|
||||
- keep eval cases and fixtures in this repo
|
||||
- avoid making Promptfoo config the deepest long-term abstraction
|
||||
|
||||
More specifically:
|
||||
|
||||
1. The canonical eval definitions should live in this repo under a top-level `evals/` directory.
|
||||
2. `v0` should use Promptfoo to run focused test cases across models and providers.
|
||||
3. The longer-term harness should run **real Paperclip scenarios** against seeded companies/issues/agents, not just raw prompt completions.
|
||||
4. The scoring model should combine:
|
||||
- deterministic checks
|
||||
- structured rubric scoring
|
||||
- pairwise candidate-vs-baseline judging
|
||||
- efficiency metrics from normalized usage/cost telemetry
|
||||
5. The framework should compare **bundles**, not just models.
|
||||
|
||||
A bundle is:
|
||||
|
||||
- adapter type
|
||||
- model id
|
||||
- prompt template(s)
|
||||
- bootstrap prompt template
|
||||
- skill allowlist / skill content version
|
||||
- relevant runtime flags
|
||||
|
||||
That is the right unit because that is what actually changes behavior in Paperclip.
|
||||
|
||||
## Why This Is The Right Shape
|
||||
|
||||
### 1. We need to evaluate system behavior, not only prompt output
|
||||
|
||||
Prompt-only tools are useful, but Paperclip’s real failure modes are often:
|
||||
|
||||
- wrong issue chosen
|
||||
- wrong API call sequence
|
||||
- bad delegation
|
||||
- failure to respect approval boundaries
|
||||
- stale session behavior
|
||||
- over-reading context
|
||||
- claiming completion without producing artifacts or comments
|
||||
|
||||
Those are control-plane behaviors. They require scenario setup, execution, and trace inspection.
|
||||
|
||||
### 2. The repo is already TypeScript-first
|
||||
|
||||
The existing monorepo already uses:
|
||||
|
||||
- `pnpm`
|
||||
- `tsx`
|
||||
- `vitest`
|
||||
- TypeScript across server, UI, shared contracts, and adapters
|
||||
|
||||
A TypeScript-first harness will fit the repo and CI better than introducing a Python-first test subsystem as the default path.
|
||||
|
||||
Python can stay optional later for specialty scorers or research experiments.
|
||||
|
||||
### 3. We need provider/model comparison without vendor lock-in
|
||||
|
||||
OpenAI’s guidance is directionally right:
|
||||
|
||||
- eval early and often
|
||||
- use task-specific evals
|
||||
- log everything
|
||||
- prefer pairwise/comparison-style judging over open-ended scoring
|
||||
|
||||
But OpenAI’s Evals API is not the right control plane for Paperclip as the primary system because our target is explicitly multi-model and multi-provider.
|
||||
|
||||
### 4. Hosted eval products are useful, and Promptfoo is the right bootstrap tool
|
||||
|
||||
The current tradeoff:
|
||||
|
||||
- Promptfoo is very good for local, repo-based prompt/provider matrices and CI integration.
|
||||
- LangSmith is strong on trajectory-style agent evals.
|
||||
- Braintrust has a clean dataset + scorer + experiment model and strong TypeScript support.
|
||||
|
||||
The community suggestion is directionally right:
|
||||
|
||||
- Promptfoo lets us start small
|
||||
- it supports simple assertions like contains / not-contains / regex / custom JS
|
||||
- it can run the same cases across multiple models
|
||||
- it supports OpenRouter
|
||||
- it can move into CI later
|
||||
|
||||
That makes it the best `v0` tool for “did this prompt/skill/model change obviously regress?”
|
||||
|
||||
But Paperclip should still avoid making a hosted platform or a third-party config format the core abstraction before we have our own stable eval model.
|
||||
|
||||
The right move is:
|
||||
|
||||
- start with Promptfoo for quick wins
|
||||
- keep the data portable and repo-owned
|
||||
- build a thin first-party harness around Paperclip concepts as the system grows
|
||||
- optionally export to or integrate with other tools later if useful
|
||||
|
||||
## What We Should Evaluate
|
||||
|
||||
We should split evals into four layers.
|
||||
|
||||
### Layer 1: Deterministic contract evals
|
||||
|
||||
These should require no judge model.
|
||||
|
||||
Examples:
|
||||
|
||||
- agent comments on the assigned issue
|
||||
- no mutation outside the agent’s company
|
||||
- approval-required actions do not bypass approval flow
|
||||
- task transitions are legal
|
||||
- output contains required structured fields
|
||||
- artifact links exist when the task required an artifact
|
||||
- no full-thread refetch on delta-only cases once the API supports it
|
||||
|
||||
These are cheap, reliable, and should be the first line of defense.
|
||||
|
||||
### Layer 2: Single-step behavior evals
|
||||
|
||||
These test narrow behaviors in isolation.
|
||||
|
||||
Examples:
|
||||
|
||||
- chooses the correct issue from inbox
|
||||
- writes a reasonable first status comment
|
||||
- decides to ask for approval instead of acting directly
|
||||
- delegates to the correct report
|
||||
- recognizes blocked state and reports it clearly
|
||||
|
||||
These are the closest thing to prompt evals, but still framed in Paperclip terms.
|
||||
|
||||
### Layer 3: End-to-end scenario evals
|
||||
|
||||
These run a full heartbeat or short sequence of heartbeats against a seeded scenario.
|
||||
|
||||
Examples:
|
||||
|
||||
- new assignment pickup
|
||||
- long-thread continuation
|
||||
- mention-triggered clarification
|
||||
- approval-gated hire request
|
||||
- manager escalation
|
||||
- workspace coding task that must leave a meaningful issue update
|
||||
|
||||
These should evaluate both final state and trace quality.
|
||||
|
||||
### Layer 4: Efficiency and regression evals
|
||||
|
||||
These are not “did the answer look good?” evals. They are “did we preserve quality while improving cost/latency?” evals.
|
||||
|
||||
Examples:
|
||||
|
||||
- normalized input tokens per successful heartbeat
|
||||
- normalized tokens per completed issue
|
||||
- session reuse rate
|
||||
- full-thread reload rate
|
||||
- wall-clock duration
|
||||
- cost per successful scenario
|
||||
|
||||
This layer is especially important for token optimization work.
|
||||
|
||||
## Core Design
|
||||
|
||||
## 1. Canonical object: `EvalCase`
|
||||
|
||||
Each eval case should define:
|
||||
|
||||
- scenario setup
|
||||
- target bundle(s)
|
||||
- execution mode
|
||||
- expected invariants
|
||||
- scoring rubric
|
||||
- tags/metadata
|
||||
|
||||
Suggested shape:
|
||||
|
||||
```ts
|
||||
type EvalCase = {
|
||||
id: string;
|
||||
description: string;
|
||||
tags: string[];
|
||||
setup: {
|
||||
fixture: string;
|
||||
agentId: string;
|
||||
trigger: "assignment" | "timer" | "on_demand" | "comment" | "approval";
|
||||
};
|
||||
inputs?: Record<string, unknown>;
|
||||
checks: {
|
||||
hard: HardCheck[];
|
||||
rubric?: RubricCheck[];
|
||||
pairwise?: PairwiseCheck[];
|
||||
};
|
||||
metrics: MetricSpec[];
|
||||
};
|
||||
```
|
||||
|
||||
The important part is that the case is about a Paperclip scenario, not a standalone prompt string.
|
||||
|
||||
## 2. Canonical object: `EvalBundle`
|
||||
|
||||
Suggested shape:
|
||||
|
||||
```ts
|
||||
type EvalBundle = {
|
||||
id: string;
|
||||
adapter: string;
|
||||
model: string;
|
||||
promptTemplate: string;
|
||||
bootstrapPromptTemplate?: string;
|
||||
skills: string[];
|
||||
flags?: Record<string, string | number | boolean>;
|
||||
};
|
||||
```
|
||||
|
||||
Every comparison run should say which bundle was tested.
|
||||
|
||||
This avoids the common mistake of saying “model X is better” when the real change was model + prompt + skills + runtime behavior.
|
||||
|
||||
## 3. Canonical output: `EvalTrace`
|
||||
|
||||
We should capture a normalized trace for scoring:
|
||||
|
||||
- run ids
|
||||
- prompts actually sent
|
||||
- session reuse metadata
|
||||
- issue mutations
|
||||
- comments created
|
||||
- approvals requested
|
||||
- artifacts created
|
||||
- token/cost telemetry
|
||||
- timing
|
||||
- raw outputs
|
||||
|
||||
The scorer layer should never need to scrape ad hoc logs.
|
||||
|
||||
## Scoring Framework
|
||||
|
||||
## 1. Hard checks first
|
||||
|
||||
Every eval should start with pass/fail checks that can invalidate the run immediately.
|
||||
|
||||
Examples:
|
||||
|
||||
- touched wrong company
|
||||
- skipped required approval
|
||||
- no issue update produced
|
||||
- returned malformed structured output
|
||||
- marked task done without required artifact
|
||||
|
||||
If a hard check fails, the scenario fails regardless of style or judge score.
|
||||
|
||||
## 2. Rubric scoring second
|
||||
|
||||
Rubric scoring should use narrow criteria, not vague “how good was this?” prompts.
|
||||
|
||||
Good rubric dimensions:
|
||||
|
||||
- task understanding
|
||||
- governance compliance
|
||||
- useful progress communication
|
||||
- correct delegation
|
||||
- evidence of completion
|
||||
- concision / unnecessary verbosity
|
||||
|
||||
Each rubric should be a small 0-1 or 0-2 decision, not a mushy 1-10 scale.
|
||||
|
||||
## 3. Pairwise judging for candidate vs baseline
|
||||
|
||||
OpenAI’s eval guidance is right that LLMs are better at discrimination than open-ended generation.
|
||||
|
||||
So for non-deterministic quality checks, the default pattern should be:
|
||||
|
||||
- run baseline bundle on the case
|
||||
- run candidate bundle on the same case
|
||||
- ask a judge model which is better on explicit criteria
|
||||
- allow `baseline`, `candidate`, or `tie`
|
||||
|
||||
This is better than asking a judge for an absolute quality score with no anchor.
|
||||
|
||||
## 4. Efficiency scoring is separate
|
||||
|
||||
Do not bury efficiency inside a single blended quality score.
|
||||
|
||||
Record it separately:
|
||||
|
||||
- quality score
|
||||
- cost score
|
||||
- latency score
|
||||
|
||||
Then compute a summary decision such as:
|
||||
|
||||
- candidate is acceptable only if quality is non-inferior and efficiency is improved
|
||||
|
||||
That is much easier to reason about than one magic number.
|
||||
|
||||
## Suggested Decision Rule
|
||||
|
||||
For PR gating:
|
||||
|
||||
1. No hard-check regressions.
|
||||
2. No significant regression on required scenario pass rate.
|
||||
3. No significant regression on key rubric dimensions.
|
||||
4. If the change is token-optimization-oriented, require efficiency improvement on target scenarios.
|
||||
|
||||
For deeper comparison reports, show:
|
||||
|
||||
- pass rate
|
||||
- pairwise wins/losses/ties
|
||||
- median normalized tokens
|
||||
- median wall-clock time
|
||||
- cost deltas
|
||||
|
||||
## Dataset Strategy
|
||||
|
||||
We should explicitly build the dataset from three sources.
|
||||
|
||||
### 1. Hand-authored seed cases
|
||||
|
||||
Start here.
|
||||
|
||||
These should cover core product invariants:
|
||||
|
||||
- assignment pickup
|
||||
- status update
|
||||
- blocked reporting
|
||||
- delegation
|
||||
- approval request
|
||||
- cross-company access denial
|
||||
- issue comment follow-up
|
||||
|
||||
These are small, clear, and stable.
|
||||
|
||||
### 2. Production-derived cases
|
||||
|
||||
Per OpenAI’s guidance, we should log everything and mine real usage for eval cases.
|
||||
|
||||
Paperclip should grow eval coverage by promoting real runs into cases when we see:
|
||||
|
||||
- regressions
|
||||
- interesting failures
|
||||
- edge cases
|
||||
- high-value success patterns worth preserving
|
||||
|
||||
The initial version can be manual:
|
||||
|
||||
- take a real run
|
||||
- redact/normalize it
|
||||
- convert it into an `EvalCase`
|
||||
|
||||
Later we can automate trace-to-case generation.
|
||||
|
||||
### 3. Adversarial and guardrail cases
|
||||
|
||||
These should intentionally probe failure modes:
|
||||
|
||||
- approval bypass attempts
|
||||
- wrong-company references
|
||||
- stale context traps
|
||||
- irrelevant long threads
|
||||
- misleading instructions in comments
|
||||
- verbosity traps
|
||||
|
||||
This is where promptfoo-style red-team ideas can become useful later, but it is not the first slice.
|
||||
|
||||
## Repo Layout
|
||||
|
||||
Recommended initial layout:
|
||||
|
||||
```text
|
||||
evals/
|
||||
README.md
|
||||
promptfoo/
|
||||
promptfooconfig.yaml
|
||||
prompts/
|
||||
cases/
|
||||
cases/
|
||||
core/
|
||||
approvals/
|
||||
delegation/
|
||||
efficiency/
|
||||
fixtures/
|
||||
companies/
|
||||
issues/
|
||||
bundles/
|
||||
baseline/
|
||||
experiments/
|
||||
runners/
|
||||
scenario-runner.ts
|
||||
compare-runner.ts
|
||||
scorers/
|
||||
hard/
|
||||
rubric/
|
||||
pairwise/
|
||||
judges/
|
||||
rubric-judge.ts
|
||||
pairwise-judge.ts
|
||||
lib/
|
||||
types.ts
|
||||
traces.ts
|
||||
metrics.ts
|
||||
reports/
|
||||
.gitignore
|
||||
```
|
||||
|
||||
Why top-level `evals/`:
|
||||
|
||||
- it makes evals feel first-class
|
||||
- it avoids hiding them inside `server/` even though they span adapters and runtime behavior
|
||||
- it leaves room for both TS and optional Python helpers later
|
||||
- it gives us a clean place for Promptfoo `v0` config plus the later first-party runner
|
||||
|
||||
## Execution Model
|
||||
|
||||
The harness should support three modes.
|
||||
|
||||
### Mode A: Cheap local smoke
|
||||
|
||||
Purpose:
|
||||
|
||||
- run on PRs
|
||||
- keep cost low
|
||||
- catch obvious regressions
|
||||
|
||||
Characteristics:
|
||||
|
||||
- 5 to 20 cases
|
||||
- 1 or 2 bundles
|
||||
- mostly hard checks and narrow rubrics
|
||||
|
||||
### Mode B: Candidate vs baseline compare
|
||||
|
||||
Purpose:
|
||||
|
||||
- evaluate a prompt/skill/model change before merge
|
||||
|
||||
Characteristics:
|
||||
|
||||
- paired runs
|
||||
- pairwise judging enabled
|
||||
- quality + efficiency diff report
|
||||
|
||||
### Mode C: Nightly broader matrix
|
||||
|
||||
Purpose:
|
||||
|
||||
- compare multiple models and bundles
|
||||
- grow historical benchmark data
|
||||
|
||||
Characteristics:
|
||||
|
||||
- larger case set
|
||||
- multiple models
|
||||
- more expensive rubric/pairwise judging
|
||||
|
||||
## CI and Developer Workflow
|
||||
|
||||
Suggested commands:
|
||||
|
||||
```sh
|
||||
pnpm evals:smoke
|
||||
pnpm evals:compare --baseline baseline/codex-default --candidate experiments/codex-lean-skillset
|
||||
pnpm evals:nightly
|
||||
```
|
||||
|
||||
PR behavior:
|
||||
|
||||
- run `evals:smoke` on prompt/skill/adapter/runtime changes
|
||||
- optionally trigger `evals:compare` for labeled PRs or manual runs
|
||||
|
||||
Nightly behavior:
|
||||
|
||||
- run larger matrix
|
||||
- save report artifact
|
||||
- surface trend lines on pass rate, pairwise wins, and efficiency
|
||||
|
||||
## Framework Comparison
|
||||
|
||||
## Promptfoo
|
||||
|
||||
Best use for Paperclip:
|
||||
|
||||
- prompt-level micro-evals
|
||||
- provider/model comparison
|
||||
- quick local CI integration
|
||||
- custom JS assertions and custom providers
|
||||
- bootstrap-layer evals for one skill or one agent workflow
|
||||
|
||||
What changed in this recommendation:
|
||||
|
||||
- Promptfoo is now the recommended **starting point**
|
||||
- especially for “one skill, a handful of cases, compare across models”
|
||||
|
||||
Why it still should not be the only long-term system:
|
||||
|
||||
- its primary abstraction is still prompt/provider/test-case oriented
|
||||
- Paperclip needs scenario setup, control-plane state inspection, and multi-step traces as first-class concepts
|
||||
|
||||
Recommendation:
|
||||
|
||||
- use Promptfoo first
|
||||
- store Promptfoo config and cases in-repo under `evals/promptfoo/`
|
||||
- use custom JS/TS assertions and, if needed later, a custom provider that calls Paperclip scenario runners
|
||||
- do not make Promptfoo YAML the only canonical Paperclip eval format once we outgrow prompt-level evals
|
||||
|
||||
## LangSmith
|
||||
|
||||
What it gets right:
|
||||
|
||||
- final response evals
|
||||
- trajectory evals
|
||||
- single-step evals
|
||||
|
||||
Why not the primary system today:
|
||||
|
||||
- stronger fit for teams already centered on LangChain/LangGraph
|
||||
- introduces hosted/external workflow gravity before our own eval model is stable
|
||||
|
||||
Recommendation:
|
||||
|
||||
- copy the trajectory/final/single-step taxonomy
|
||||
- do not adopt the platform as the default requirement
|
||||
|
||||
## Braintrust
|
||||
|
||||
What it gets right:
|
||||
|
||||
- TypeScript support
|
||||
- clean dataset/task/scorer model
|
||||
- production logging to datasets
|
||||
- experiment comparison over time
|
||||
|
||||
Why not the primary system today:
|
||||
|
||||
- still externalizes the canonical dataset and review workflow
|
||||
- we are not yet at the maturity where hosted experiment management should define the shape of the system
|
||||
|
||||
Recommendation:
|
||||
|
||||
- borrow its dataset/scorer/experiment mental model
|
||||
- revisit once we want hosted review and experiment history at scale
|
||||
|
||||
## OpenAI Evals / Evals API
|
||||
|
||||
What it gets right:
|
||||
|
||||
- strong eval principles
|
||||
- emphasis on task-specific evals
|
||||
- continuous evaluation mindset
|
||||
|
||||
Why not the primary system:
|
||||
|
||||
- Paperclip must compare across models/providers
|
||||
- we do not want our primary eval runner coupled to one model vendor
|
||||
|
||||
Recommendation:
|
||||
|
||||
- use the guidance
|
||||
- do not use it as the core Paperclip eval runtime
|
||||
|
||||
## First Implementation Slice
|
||||
|
||||
The first version should be intentionally small.
|
||||
|
||||
## Phase 0: Promptfoo bootstrap
|
||||
|
||||
Build:
|
||||
|
||||
- `evals/promptfoo/promptfooconfig.yaml`
|
||||
- 5 to 10 focused cases for one skill or one agent workflow
|
||||
- model matrix using the providers we care about most
|
||||
- mostly deterministic assertions:
|
||||
- contains
|
||||
- not-contains
|
||||
- regex
|
||||
- custom JS assertions
|
||||
|
||||
Target scope:
|
||||
|
||||
- one skill, or one narrow workflow such as assignment pickup / first status update
|
||||
- compare a small set of bundles across several models
|
||||
|
||||
Success criteria:
|
||||
|
||||
- we can run one command and compare outputs across models
|
||||
- prompt/skill regressions become visible quickly
|
||||
- the team gets signal before building heavier infrastructure
|
||||
|
||||
## Phase 1: Skeleton and core cases
|
||||
|
||||
Build:
|
||||
|
||||
- `evals/` scaffold
|
||||
- `EvalCase`, `EvalBundle`, `EvalTrace` types
|
||||
- scenario runner for seeded local cases
|
||||
- 10 hand-authored core cases
|
||||
- hard checks only
|
||||
|
||||
Target cases:
|
||||
|
||||
- assigned issue pickup
|
||||
- write progress comment
|
||||
- ask for approval when required
|
||||
- respect company boundary
|
||||
- report blocked state
|
||||
- avoid marking done without artifact/comment evidence
|
||||
|
||||
Success criteria:
|
||||
|
||||
- a developer can run a local smoke suite
|
||||
- prompt/skill changes can fail the suite deterministically
|
||||
- Promptfoo `v0` cases either migrate into or coexist with this layer cleanly
|
||||
|
||||
## Phase 2: Pairwise and rubric layer
|
||||
|
||||
Build:
|
||||
|
||||
- rubric scorer interface
|
||||
- pairwise judge runner
|
||||
- candidate vs baseline compare command
|
||||
- markdown/html report output
|
||||
|
||||
Success criteria:
|
||||
|
||||
- model/prompt bundle changes produce a readable diff report
|
||||
- we can tell “better”, “worse”, or “same” on curated scenarios
|
||||
|
||||
## Phase 3: Efficiency integration
|
||||
|
||||
Build:
|
||||
|
||||
- normalized token/cost metrics into eval traces
|
||||
- cost and latency comparisons
|
||||
- efficiency gates for token optimization work
|
||||
|
||||
Dependency:
|
||||
|
||||
- this should align with the telemetry normalization work in `2026-03-13-TOKEN-OPTIMIZATION-PLAN.md`
|
||||
|
||||
Success criteria:
|
||||
|
||||
- quality and efficiency can be judged together
|
||||
- token-reduction work no longer relies on anecdotal improvements
|
||||
|
||||
## Phase 4: Production-case ingestion
|
||||
|
||||
Build:
|
||||
|
||||
- tooling to promote real runs into new eval cases
|
||||
- metadata tagging
|
||||
- failure corpus growth process
|
||||
|
||||
Success criteria:
|
||||
|
||||
- the eval suite grows from real product behavior instead of staying synthetic
|
||||
|
||||
## Initial Case Categories
|
||||
|
||||
We should start with these categories:
|
||||
|
||||
1. `core.assignment_pickup`
|
||||
2. `core.progress_update`
|
||||
3. `core.blocked_reporting`
|
||||
4. `governance.approval_required`
|
||||
5. `governance.company_boundary`
|
||||
6. `delegation.correct_report`
|
||||
7. `threads.long_context_followup`
|
||||
8. `efficiency.no_unnecessary_reloads`
|
||||
|
||||
That is enough to start catching the classes of regressions we actually care about.
|
||||
|
||||
## Important Guardrails
|
||||
|
||||
### 1. Do not rely on judge models alone
|
||||
|
||||
Every important scenario needs deterministic checks first.
|
||||
|
||||
### 2. Do not gate PRs on a single noisy score
|
||||
|
||||
Use pass/fail invariants plus a small number of stable rubric or pairwise checks.
|
||||
|
||||
### 3. Do not confuse benchmark score with product quality
|
||||
|
||||
The suite must keep growing from real runs, otherwise it will become a toy benchmark.
|
||||
|
||||
### 4. Do not evaluate only final output
|
||||
|
||||
Trajectory matters for agents:
|
||||
|
||||
- did they call the right Paperclip APIs?
|
||||
- did they ask for approval?
|
||||
- did they communicate progress?
|
||||
- did they choose the right issue?
|
||||
|
||||
### 5. Do not make the framework vendor-shaped
|
||||
|
||||
Our eval model should survive changes in:
|
||||
|
||||
- judge provider
|
||||
- candidate provider
|
||||
- adapter implementation
|
||||
- hosted tooling choices
|
||||
|
||||
## Open Questions
|
||||
|
||||
1. Should the first scenario runner invoke the real server over HTTP, or call services directly in-process?
|
||||
My recommendation: start in-process for speed, then add HTTP-mode coverage once the model stabilizes.
|
||||
|
||||
2. Should we support Python scorers in v1?
|
||||
My recommendation: no. Keep v1 all-TypeScript.
|
||||
|
||||
3. Should we commit baseline outputs?
|
||||
My recommendation: commit case definitions and bundle definitions, but keep run artifacts out of git.
|
||||
|
||||
4. Should we add hosted experiment tracking immediately?
|
||||
My recommendation: no. Revisit after the local harness proves useful.
|
||||
|
||||
## Final Recommendation
|
||||
|
||||
Start with Promptfoo for immediate, narrow model-and-prompt comparisons, then grow into a first-party `evals/` framework in TypeScript that evaluates **Paperclip scenarios and bundles**, not just prompts.
|
||||
|
||||
Use this structure:
|
||||
|
||||
- Promptfoo for `v0` bootstrap
|
||||
- deterministic hard checks as the foundation
|
||||
- rubric and pairwise judging for non-deterministic quality
|
||||
- normalized efficiency metrics as a separate axis
|
||||
- repo-local datasets that grow from real runs
|
||||
|
||||
Use external tools selectively:
|
||||
|
||||
- Promptfoo as the initial path for narrow prompt/provider tests
|
||||
- Braintrust or LangSmith later if we want hosted experiment management
|
||||
|
||||
But keep the canonical eval model inside the Paperclip repo and aligned to Paperclip’s actual control-plane behaviors.
|
||||
186
doc/plans/2026-03-13-paperclip-skill-tightening-plan.md
Normal file
186
doc/plans/2026-03-13-paperclip-skill-tightening-plan.md
Normal file
@@ -0,0 +1,186 @@
|
||||
# Paperclip Skill Tightening Plan
|
||||
|
||||
## Status
|
||||
|
||||
Deferred follow-up. Do not include in the current token-optimization PR beyond documenting the plan.
|
||||
|
||||
## Why This Is Deferred
|
||||
|
||||
The `paperclip` skill is part of the critical control-plane safety surface. Tightening it may reduce fresh-session token use, but it also carries prompt-regression risk. We do not yet have evals that would let us safely prove behavior preservation across assignment handling, checkout rules, comment etiquette, approval workflows, and escalation paths.
|
||||
|
||||
The current PR should ship the lower-risk infrastructure wins first:
|
||||
|
||||
- telemetry normalization
|
||||
- safe session reuse
|
||||
- incremental issue/comment context
|
||||
- bootstrap versus heartbeat prompt separation
|
||||
- Codex worktree isolation
|
||||
|
||||
## Current Problem
|
||||
|
||||
Fresh runs still spend substantial input tokens even after the context-path fixes. The remaining large startup cost appears to come from loading the full `paperclip` skill and related instruction surface into context at run start.
|
||||
|
||||
The skill currently mixes three kinds of content in one file:
|
||||
|
||||
- hot-path heartbeat procedure used on nearly every run
|
||||
- critical policy and safety invariants
|
||||
- rare workflow/reference material that most runs do not need
|
||||
|
||||
That structure is safe but expensive.
|
||||
|
||||
## Goals
|
||||
|
||||
- reduce first-run instruction tokens without weakening agent safety
|
||||
- preserve all current Paperclip control-plane capabilities
|
||||
- keep common heartbeat behavior explicit and easy for agents to follow
|
||||
- move rare workflows and reference material out of the hot path
|
||||
- create a structure that can later be evaluated systematically
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- changing Paperclip API semantics
|
||||
- removing required governance rules
|
||||
- deleting rare workflows
|
||||
- changing agent defaults in the current PR
|
||||
|
||||
## Recommended Direction
|
||||
|
||||
### 1. Split Hot Path From Lookup Material
|
||||
|
||||
Restructure the skill into:
|
||||
|
||||
- an always-loaded core section for the common heartbeat loop
|
||||
- on-demand material for infrequent workflows and deep reference
|
||||
|
||||
The core should cover only what is needed on nearly every wake:
|
||||
|
||||
- auth and required headers
|
||||
- inbox-first assignment retrieval
|
||||
- mandatory checkout behavior
|
||||
- `heartbeat-context` first
|
||||
- incremental comment retrieval rules
|
||||
- mention/self-assign exception
|
||||
- blocked-task dedup
|
||||
- status/comment/release expectations before exit
|
||||
|
||||
### 2. Normalize The Skill Around One Canonical Procedure
|
||||
|
||||
The same rules are currently expressed multiple times across:
|
||||
|
||||
- heartbeat steps
|
||||
- critical rules
|
||||
- endpoint reference
|
||||
- workflow examples
|
||||
|
||||
Refactor so each operational fact has one primary home:
|
||||
|
||||
- procedure
|
||||
- invariant list
|
||||
- appendix/reference
|
||||
|
||||
This reduces prompt weight and lowers the chance of internal instruction drift.
|
||||
|
||||
### 3. Compress Prose Into High-Signal Instruction Forms
|
||||
|
||||
Rewrite the hot path using compact operational forms:
|
||||
|
||||
- short ordered checklist
|
||||
- flat invariant list
|
||||
- minimal examples only where ambiguity would be risky
|
||||
|
||||
Reduce:
|
||||
|
||||
- narrative explanation
|
||||
- repeated warnings already covered elsewhere
|
||||
- large example payloads for common operations
|
||||
- long endpoint matrices in the main body
|
||||
|
||||
### 4. Move Rare Workflows Behind Explicit Triggers
|
||||
|
||||
These workflows should remain available but should not dominate fresh-run context:
|
||||
|
||||
- OpenClaw invite flow
|
||||
- project setup flow
|
||||
- planning `<plan/>` writeback flow
|
||||
- instructions-path update flow
|
||||
- detailed link-formatting examples
|
||||
|
||||
Recommended approach:
|
||||
|
||||
- keep a short pointer in the main skill
|
||||
- move detailed procedures into sibling skills or referenced docs that agents read only when needed
|
||||
|
||||
### 5. Separate Policy From Reference
|
||||
|
||||
The skill should distinguish:
|
||||
|
||||
- mandatory operating rules
|
||||
- endpoint lookup/reference
|
||||
- business-process playbooks
|
||||
|
||||
That separation makes it easier to evaluate prompt changes later and lets adapters or orchestration choose what must always be loaded.
|
||||
|
||||
## Proposed Target Structure
|
||||
|
||||
1. Purpose and authentication
|
||||
2. Compact heartbeat procedure
|
||||
3. Hard invariants
|
||||
4. Required comment/update style
|
||||
5. Triggered workflow index
|
||||
6. Appendix/reference
|
||||
|
||||
## Rollout Plan
|
||||
|
||||
### Phase 1. Inventory And Measure
|
||||
|
||||
- annotate the current skill by section and estimate token weight
|
||||
- identify which sections are truly hot-path versus rare
|
||||
- capture representative runs to compare before/after prompt size and behavior
|
||||
|
||||
### Phase 2. Structural Refactor Without Semantic Changes
|
||||
|
||||
- rewrite the main skill into the target structure
|
||||
- preserve all existing rules and capabilities
|
||||
- move rare workflow details into referenced companion material
|
||||
- keep wording changes conservative
|
||||
|
||||
### Phase 3. Validate Against Real Scenarios
|
||||
|
||||
Run scenario checks for:
|
||||
|
||||
- normal assigned heartbeat
|
||||
- comment-triggered wake
|
||||
- blocked-task dedup behavior
|
||||
- approval-resolution wake
|
||||
- delegation/subtask creation
|
||||
- board handoff back to user
|
||||
- plan-request handling
|
||||
|
||||
### Phase 4. Decide Default Loading Strategy
|
||||
|
||||
After validation, decide whether:
|
||||
|
||||
- the entire main skill still loads by default, or
|
||||
- only the compact core loads by default and rare sections are fetched on demand
|
||||
|
||||
Do not change this loading policy without validation.
|
||||
|
||||
## Risks
|
||||
|
||||
- prompt degradation on control-plane safety rules
|
||||
- agents forgetting rare but important workflows
|
||||
- accidental removal of repeated wording that was carrying useful behavior
|
||||
- introducing ambiguous instruction precedence between the core skill and companion materials
|
||||
|
||||
## Preconditions Before Implementation
|
||||
|
||||
- define acceptance scenarios for control-plane correctness
|
||||
- add at least lightweight eval or scripted scenario coverage for key Paperclip flows
|
||||
- confirm how adapter/bootstrap layering should load skill content versus references
|
||||
|
||||
## Success Criteria
|
||||
|
||||
- materially lower first-run input tokens for Paperclip-coordinated agents
|
||||
- no regression in checkout discipline, issue updates, blocked handling, or delegation
|
||||
- no increase in malformed API usage or ownership mistakes
|
||||
- agents still complete rare workflows correctly when explicitly asked
|
||||
Reference in New Issue
Block a user