docs: add token optimization plan
Co-Authored-By: Paperclip <noreply@paperclip.ing>
This commit is contained in:
383
doc/plans/2026-03-13-TOKEN-OPTIMIZATION-PLAN.md
Normal file
383
doc/plans/2026-03-13-TOKEN-OPTIMIZATION-PLAN.md
Normal file
@@ -0,0 +1,383 @@
|
||||
# Token Optimization Plan
|
||||
|
||||
Date: 2026-03-13
|
||||
Related discussion: https://github.com/paperclipai/paperclip/discussions/449
|
||||
|
||||
## Goal
|
||||
|
||||
Reduce token consumption materially without reducing agent capability, control-plane visibility, or task completion quality.
|
||||
|
||||
This plan is based on:
|
||||
|
||||
- the current V1 control-plane design
|
||||
- the current adapter and heartbeat implementation
|
||||
- the linked user discussion
|
||||
- local runtime data from the default Paperclip instance on 2026-03-13
|
||||
|
||||
## Executive Summary
|
||||
|
||||
The discussion is directionally right about two things:
|
||||
|
||||
1. We should preserve session and prompt-cache locality more aggressively.
|
||||
2. We should separate stable startup instructions from per-heartbeat dynamic context.
|
||||
|
||||
But that is not enough on its own.
|
||||
|
||||
After reviewing the code and local run data, the token problem appears to have four distinct causes:
|
||||
|
||||
1. **Measurement inflation on sessioned adapters.** Some token counters, especially for `codex_local`, appear to be recorded as cumulative session totals instead of per-heartbeat deltas.
|
||||
2. **Avoidable session resets.** Task sessions are intentionally reset on timer wakes and manual wakes, which destroys cache locality for common heartbeat paths.
|
||||
3. **Repeated context reacquisition.** The `paperclip` skill tells agents to re-fetch assignments, issue details, ancestors, and full comment threads on every heartbeat. The API does not currently offer efficient delta-oriented alternatives.
|
||||
4. **Large static instruction surfaces.** Agent instruction files and globally injected skills are reintroduced at startup even when most of that content is unchanged and not needed for the current task.
|
||||
|
||||
The correct approach is:
|
||||
|
||||
1. fix telemetry so we can trust the numbers
|
||||
2. preserve reuse where it is safe
|
||||
3. make context retrieval incremental
|
||||
4. add session compaction/rotation so long-lived sessions do not become progressively more expensive
|
||||
|
||||
## Validated Findings
|
||||
|
||||
### 1. Token telemetry is at least partly overstated today
|
||||
|
||||
Observed from the local default instance:
|
||||
|
||||
- `heartbeat_runs`: 11,360 runs between 2026-02-18 and 2026-03-13
|
||||
- summed `usage_json.inputTokens`: `2,272,142,368,952`
|
||||
- summed `usage_json.cachedInputTokens`: `2,217,501,559,420`
|
||||
|
||||
Those totals are not credible as true per-heartbeat usage for the observed prompt sizes.
|
||||
|
||||
Supporting evidence:
|
||||
|
||||
- `adapter.invoke.payload.prompt` averages were small:
|
||||
- `codex_local`: ~193 chars average, 6,067 chars max
|
||||
- `claude_local`: ~160 chars average, 1,160 chars max
|
||||
- despite that, many `codex_local` runs report millions of input tokens
|
||||
- one reused Codex session in local data spans 3,607 runs and recorded `inputTokens` growing up to `1,155,283,166`
|
||||
|
||||
Interpretation:
|
||||
|
||||
- for sessioned adapters, especially Codex, we are likely storing usage reported by the runtime as a **session total**, not a **per-run delta**
|
||||
- this makes trend reporting, optimization work, and customer trust worse
|
||||
|
||||
This does **not** mean there is no real token problem. It means we need a trustworthy baseline before we can judge optimization impact.
|
||||
|
||||
### 2. Timer wakes currently throw away reusable task sessions
|
||||
|
||||
In `server/src/services/heartbeat.ts`, `shouldResetTaskSessionForWake(...)` returns `true` for:
|
||||
|
||||
- `wakeReason === "issue_assigned"`
|
||||
- `wakeSource === "timer"`
|
||||
- manual on-demand wakes
|
||||
|
||||
That means many normal heartbeats skip saved task-session resume even when the workspace is stable.
|
||||
|
||||
Local data supports the impact:
|
||||
|
||||
- `timer/system` runs: 6,587 total
|
||||
- only 976 had a previous session
|
||||
- only 963 ended with the same session
|
||||
|
||||
So timer wakes are the largest heartbeat path and are mostly not resuming prior task state.
|
||||
|
||||
### 3. We repeatedly ask agents to reload the same task context
|
||||
|
||||
The `paperclip` skill currently tells agents to do this on essentially every heartbeat:
|
||||
|
||||
- fetch assignments
|
||||
- fetch issue details
|
||||
- fetch ancestor chain
|
||||
- fetch full issue comments
|
||||
|
||||
Current API shape reinforces that pattern:
|
||||
|
||||
- `GET /api/issues/:id/comments` returns the full thread
|
||||
- there is no `since`, cursor, digest, or summary endpoint for heartbeat consumption
|
||||
- `GET /api/issues/:id` returns full enriched issue context, not a minimal delta payload
|
||||
|
||||
This is safe but expensive. It forces the model to repeatedly consume unchanged information.
|
||||
|
||||
### 4. Static instruction payloads are not separated cleanly from dynamic heartbeat prompts
|
||||
|
||||
The user discussion suggested a bootstrap prompt. That is the right direction.
|
||||
|
||||
Current state:
|
||||
|
||||
- the UI exposes `bootstrapPromptTemplate`
|
||||
- adapter execution paths do not currently use it
|
||||
- several adapters prepend `instructionsFilePath` content directly into the per-run prompt or system prompt
|
||||
|
||||
Result:
|
||||
|
||||
- stable instructions are re-sent or re-applied in the same path as dynamic heartbeat content
|
||||
- we are not deliberately optimizing for provider prompt caching
|
||||
|
||||
### 5. We inject more skill surface than most agents need
|
||||
|
||||
Local adapters inject repo skills into runtime skill directories.
|
||||
|
||||
Current repo skill sizes:
|
||||
|
||||
- `skills/paperclip/SKILL.md`: 17,441 bytes
|
||||
- `skills/create-agent-adapter/SKILL.md`: 31,832 bytes
|
||||
- `skills/paperclip-create-agent/SKILL.md`: 4,718 bytes
|
||||
- `skills/para-memory-files/SKILL.md`: 3,978 bytes
|
||||
|
||||
That is nearly 58 KB of skill markdown before any company-specific instructions.
|
||||
|
||||
Not all of that is necessarily loaded into model context every run, but it increases startup surface area and should be treated as a token budget concern.
|
||||
|
||||
## Principles
|
||||
|
||||
We should optimize tokens under these rules:
|
||||
|
||||
1. **Do not lose functionality.** Agents must still be able to resume work safely, understand why tasks exist, and act within governance rules.
|
||||
2. **Prefer stable context over repeated context.** Unchanged instructions should not be resent through the most expensive path.
|
||||
3. **Prefer deltas over full reloads.** Heartbeats should consume only what changed since the last useful run.
|
||||
4. **Measure normalized deltas, not raw adapter claims.** Especially for sessioned CLIs.
|
||||
5. **Keep escape hatches.** Board/manual runs may still want a forced fresh session.
|
||||
|
||||
## Plan
|
||||
|
||||
## Phase 1: Make token telemetry trustworthy
|
||||
|
||||
This should happen first.
|
||||
|
||||
### Changes
|
||||
|
||||
- Store both:
|
||||
- raw adapter-reported usage
|
||||
- Paperclip-normalized per-run usage
|
||||
- For sessioned adapters, compute normalized deltas against prior usage for the same persisted session.
|
||||
- Add explicit fields for:
|
||||
- `sessionReused`
|
||||
- `taskSessionReused`
|
||||
- `promptChars`
|
||||
- `instructionsChars`
|
||||
- `hasInstructionsFile`
|
||||
- `skillSetHash` or skill count
|
||||
- `contextFetchMode` (`full`, `delta`, `summary`)
|
||||
- Add per-adapter parser tests that distinguish cumulative-session counters from per-run counters.
|
||||
|
||||
### Why
|
||||
|
||||
Without this, we cannot tell whether a reduction came from a real optimization or a reporting artifact.
|
||||
|
||||
### Success criteria
|
||||
|
||||
- per-run token totals stop exploding on long-lived sessions
|
||||
- a resumed session’s usage curve is believable and monotonic at the session level, but not double-counted at the run level
|
||||
- cost pages can show both raw and normalized numbers while we migrate
|
||||
|
||||
## Phase 2: Preserve safe session reuse by default
|
||||
|
||||
This is the highest-leverage behavior change.
|
||||
|
||||
### Changes
|
||||
|
||||
- Stop resetting task sessions on ordinary timer wakes.
|
||||
- Keep resetting on:
|
||||
- explicit manual “fresh run” invocations
|
||||
- assignment changes
|
||||
- workspace mismatch
|
||||
- model mismatch / invalid resume errors
|
||||
- Add an explicit wake flag like `forceFreshSession: true` when the board wants a reset.
|
||||
- Record why a session was reused or reset in run metadata.
|
||||
|
||||
### Why
|
||||
|
||||
Timer wakes are the dominant heartbeat path. Resetting them destroys both session continuity and prompt cache reuse.
|
||||
|
||||
### Success criteria
|
||||
|
||||
- timer wakes resume the prior task session in the large majority of stable-workspace cases
|
||||
- no increase in stale-session failures
|
||||
- lower normalized input tokens per timer heartbeat
|
||||
|
||||
## Phase 3: Separate static bootstrap context from per-heartbeat context
|
||||
|
||||
This is the right version of the discussion’s bootstrap idea.
|
||||
|
||||
### Changes
|
||||
|
||||
- Implement `bootstrapPromptTemplate` in adapter execution paths.
|
||||
- Use it only when starting a fresh session, not on resumed sessions.
|
||||
- Keep `promptTemplate` intentionally small and stable:
|
||||
- who I am
|
||||
- what triggered this wake
|
||||
- which task/comment/approval to prioritize
|
||||
- Move long-lived setup text out of recurring per-run prompts where possible.
|
||||
- Add UI guidance and warnings when `promptTemplate` contains high-churn or large inline content.
|
||||
|
||||
### Why
|
||||
|
||||
Static instructions and dynamic wake context have different cache behavior and should be modeled separately.
|
||||
|
||||
### Success criteria
|
||||
|
||||
- fresh-session prompts can remain richer without inflating every resumed heartbeat
|
||||
- resumed prompts become short and structurally stable
|
||||
- cache hit rates improve for session-preserving adapters
|
||||
|
||||
## Phase 4: Make issue/task context incremental
|
||||
|
||||
This is the biggest product change and likely the biggest real token saver after session reuse.
|
||||
|
||||
### Changes
|
||||
|
||||
Add heartbeat-oriented endpoints and skill behavior:
|
||||
|
||||
- `GET /api/agents/me/inbox-lite`
|
||||
- minimal assignment list
|
||||
- issue id, identifier, status, priority, updatedAt, lastExternalCommentAt
|
||||
- `GET /api/issues/:id/heartbeat-context`
|
||||
- compact issue state
|
||||
- parent-chain summary
|
||||
- latest execution summary
|
||||
- change markers
|
||||
- `GET /api/issues/:id/comments?after=<cursor>` or `?since=<timestamp>`
|
||||
- return only new comments
|
||||
- optional `GET /api/issues/:id/context-digest`
|
||||
- server-generated compact summary for heartbeat use
|
||||
|
||||
Update the `paperclip` skill so the default pattern becomes:
|
||||
|
||||
1. fetch compact inbox
|
||||
2. fetch compact task context
|
||||
3. fetch only new comments unless this is the first read, a mention-triggered wake, or a cache miss
|
||||
4. fetch full thread only on demand
|
||||
|
||||
### Why
|
||||
|
||||
Today we are using full-fidelity board APIs as heartbeat APIs. That is convenient but token-inefficient.
|
||||
|
||||
### Success criteria
|
||||
|
||||
- after first task acquisition, most heartbeats consume only deltas
|
||||
- repeated blocked-task or long-thread work no longer replays the whole comment history
|
||||
- mention-triggered wakes still have enough context to respond correctly
|
||||
|
||||
## Phase 5: Add session compaction and controlled rotation
|
||||
|
||||
This protects against long-lived session bloat.
|
||||
|
||||
### Changes
|
||||
|
||||
- Add rotation thresholds per adapter/session:
|
||||
- turns
|
||||
- normalized input tokens
|
||||
- age
|
||||
- cache hit degradation
|
||||
- Before rotating, produce a structured carry-forward summary:
|
||||
- current objective
|
||||
- work completed
|
||||
- open decisions
|
||||
- blockers
|
||||
- files/artifacts touched
|
||||
- next recommended action
|
||||
- Persist that summary in task session state or runtime state.
|
||||
- Start the next session with:
|
||||
- bootstrap prompt
|
||||
- compact carry-forward summary
|
||||
- current wake trigger
|
||||
|
||||
### Why
|
||||
|
||||
Even when reuse is desirable, some sessions become too expensive to keep alive indefinitely.
|
||||
|
||||
### Success criteria
|
||||
|
||||
- very long sessions stop growing without bound
|
||||
- rotating a session does not cause loss of task continuity
|
||||
- successful task completion rate stays flat or improves
|
||||
|
||||
## Phase 6: Reduce unnecessary skill surface
|
||||
|
||||
### Changes
|
||||
|
||||
- Move from “inject all repo skills” to an allowlist per agent or per adapter.
|
||||
- Default local runtime skill set should likely be:
|
||||
- `paperclip`
|
||||
- Add opt-in skills for specialized agents:
|
||||
- `paperclip-create-agent`
|
||||
- `para-memory-files`
|
||||
- `create-agent-adapter`
|
||||
- Expose active skill set in agent config and run metadata.
|
||||
|
||||
### Why
|
||||
|
||||
Most agents do not need adapter-authoring or memory-system skills on every run.
|
||||
|
||||
### Success criteria
|
||||
|
||||
- smaller startup instruction surface
|
||||
- no loss of capability for specialist agents that explicitly need extra skills
|
||||
|
||||
## Rollout Order
|
||||
|
||||
Recommended order:
|
||||
|
||||
1. telemetry normalization
|
||||
2. timer-wake session reuse
|
||||
3. bootstrap prompt implementation
|
||||
4. heartbeat delta APIs + `paperclip` skill rewrite
|
||||
5. session compaction/rotation
|
||||
6. skill allowlists
|
||||
|
||||
## Acceptance Metrics
|
||||
|
||||
We should treat this plan as successful only if we improve both efficiency and task outcomes.
|
||||
|
||||
Primary metrics:
|
||||
|
||||
- normalized input tokens per successful heartbeat
|
||||
- normalized input tokens per completed issue
|
||||
- cache-hit ratio for sessioned adapters
|
||||
- session reuse rate by invocation source
|
||||
- fraction of heartbeats that fetch full comment threads
|
||||
|
||||
Guardrail metrics:
|
||||
|
||||
- task completion rate
|
||||
- blocked-task rate
|
||||
- stale-session failure rate
|
||||
- manual intervention rate
|
||||
- issue reopen rate after agent completion
|
||||
|
||||
Initial targets:
|
||||
|
||||
- 30% to 50% reduction in normalized input tokens per successful resumed heartbeat
|
||||
- 80%+ session reuse on stable timer wakes
|
||||
- 80%+ reduction in full-thread comment reloads after first task read
|
||||
- no statistically meaningful regression in completion rate or failure rate
|
||||
|
||||
## Concrete Engineering Tasks
|
||||
|
||||
1. Add normalized usage fields and migration support for run analytics.
|
||||
2. Patch sessioned adapter accounting to compute deltas from prior session totals.
|
||||
3. Change `shouldResetTaskSessionForWake(...)` so timer wakes do not reset by default.
|
||||
4. Implement `bootstrapPromptTemplate` end-to-end in adapter execution.
|
||||
5. Add compact heartbeat context and incremental comment APIs.
|
||||
6. Rewrite `skills/paperclip/SKILL.md` around delta-fetch behavior.
|
||||
7. Add session rotation with carry-forward summaries.
|
||||
8. Replace global skill injection with explicit allowlists.
|
||||
|
||||
## Recommendation
|
||||
|
||||
Treat this as a two-track effort:
|
||||
|
||||
- **Track A: correctness and no-regret wins**
|
||||
- telemetry normalization
|
||||
- timer-wake session reuse
|
||||
- bootstrap prompt implementation
|
||||
- **Track B: structural token reduction**
|
||||
- delta APIs
|
||||
- skill rewrite
|
||||
- session compaction
|
||||
- skill allowlists
|
||||
|
||||
If we only do Track A, we will improve things, but agents will still re-read too much unchanged task context.
|
||||
|
||||
If we only do Track B without fixing telemetry first, we will not be able to prove the gains cleanly.
|
||||
Reference in New Issue
Block a user