From 84fc6d4a87f262c7c1bc4e1f3bf8c5baa13c202b Mon Sep 17 00:00:00 2001 From: Dotta Date: Fri, 13 Mar 2026 08:04:57 -0500 Subject: [PATCH] docs: add token optimization plan Co-Authored-By: Paperclip --- .../2026-03-13-TOKEN-OPTIMIZATION-PLAN.md | 383 ++++++++++++++++++ 1 file changed, 383 insertions(+) create mode 100644 doc/plans/2026-03-13-TOKEN-OPTIMIZATION-PLAN.md diff --git a/doc/plans/2026-03-13-TOKEN-OPTIMIZATION-PLAN.md b/doc/plans/2026-03-13-TOKEN-OPTIMIZATION-PLAN.md new file mode 100644 index 00000000..678444ac --- /dev/null +++ b/doc/plans/2026-03-13-TOKEN-OPTIMIZATION-PLAN.md @@ -0,0 +1,383 @@ +# Token Optimization Plan + +Date: 2026-03-13 +Related discussion: https://github.com/paperclipai/paperclip/discussions/449 + +## Goal + +Reduce token consumption materially without reducing agent capability, control-plane visibility, or task completion quality. + +This plan is based on: + +- the current V1 control-plane design +- the current adapter and heartbeat implementation +- the linked user discussion +- local runtime data from the default Paperclip instance on 2026-03-13 + +## Executive Summary + +The discussion is directionally right about two things: + +1. We should preserve session and prompt-cache locality more aggressively. +2. We should separate stable startup instructions from per-heartbeat dynamic context. + +But that is not enough on its own. + +After reviewing the code and local run data, the token problem appears to have four distinct causes: + +1. **Measurement inflation on sessioned adapters.** Some token counters, especially for `codex_local`, appear to be recorded as cumulative session totals instead of per-heartbeat deltas. +2. **Avoidable session resets.** Task sessions are intentionally reset on timer wakes and manual wakes, which destroys cache locality for common heartbeat paths. +3. **Repeated context reacquisition.** The `paperclip` skill tells agents to re-fetch assignments, issue details, ancestors, and full comment threads on every heartbeat. The API does not currently offer efficient delta-oriented alternatives. +4. **Large static instruction surfaces.** Agent instruction files and globally injected skills are reintroduced at startup even when most of that content is unchanged and not needed for the current task. + +The correct approach is: + +1. fix telemetry so we can trust the numbers +2. preserve reuse where it is safe +3. make context retrieval incremental +4. add session compaction/rotation so long-lived sessions do not become progressively more expensive + +## Validated Findings + +### 1. Token telemetry is at least partly overstated today + +Observed from the local default instance: + +- `heartbeat_runs`: 11,360 runs between 2026-02-18 and 2026-03-13 +- summed `usage_json.inputTokens`: `2,272,142,368,952` +- summed `usage_json.cachedInputTokens`: `2,217,501,559,420` + +Those totals are not credible as true per-heartbeat usage for the observed prompt sizes. + +Supporting evidence: + +- `adapter.invoke.payload.prompt` averages were small: + - `codex_local`: ~193 chars average, 6,067 chars max + - `claude_local`: ~160 chars average, 1,160 chars max +- despite that, many `codex_local` runs report millions of input tokens +- one reused Codex session in local data spans 3,607 runs and recorded `inputTokens` growing up to `1,155,283,166` + +Interpretation: + +- for sessioned adapters, especially Codex, we are likely storing usage reported by the runtime as a **session total**, not a **per-run delta** +- this makes trend reporting, optimization work, and customer trust worse + +This does **not** mean there is no real token problem. It means we need a trustworthy baseline before we can judge optimization impact. + +### 2. Timer wakes currently throw away reusable task sessions + +In `server/src/services/heartbeat.ts`, `shouldResetTaskSessionForWake(...)` returns `true` for: + +- `wakeReason === "issue_assigned"` +- `wakeSource === "timer"` +- manual on-demand wakes + +That means many normal heartbeats skip saved task-session resume even when the workspace is stable. + +Local data supports the impact: + +- `timer/system` runs: 6,587 total +- only 976 had a previous session +- only 963 ended with the same session + +So timer wakes are the largest heartbeat path and are mostly not resuming prior task state. + +### 3. We repeatedly ask agents to reload the same task context + +The `paperclip` skill currently tells agents to do this on essentially every heartbeat: + +- fetch assignments +- fetch issue details +- fetch ancestor chain +- fetch full issue comments + +Current API shape reinforces that pattern: + +- `GET /api/issues/:id/comments` returns the full thread +- there is no `since`, cursor, digest, or summary endpoint for heartbeat consumption +- `GET /api/issues/:id` returns full enriched issue context, not a minimal delta payload + +This is safe but expensive. It forces the model to repeatedly consume unchanged information. + +### 4. Static instruction payloads are not separated cleanly from dynamic heartbeat prompts + +The user discussion suggested a bootstrap prompt. That is the right direction. + +Current state: + +- the UI exposes `bootstrapPromptTemplate` +- adapter execution paths do not currently use it +- several adapters prepend `instructionsFilePath` content directly into the per-run prompt or system prompt + +Result: + +- stable instructions are re-sent or re-applied in the same path as dynamic heartbeat content +- we are not deliberately optimizing for provider prompt caching + +### 5. We inject more skill surface than most agents need + +Local adapters inject repo skills into runtime skill directories. + +Current repo skill sizes: + +- `skills/paperclip/SKILL.md`: 17,441 bytes +- `skills/create-agent-adapter/SKILL.md`: 31,832 bytes +- `skills/paperclip-create-agent/SKILL.md`: 4,718 bytes +- `skills/para-memory-files/SKILL.md`: 3,978 bytes + +That is nearly 58 KB of skill markdown before any company-specific instructions. + +Not all of that is necessarily loaded into model context every run, but it increases startup surface area and should be treated as a token budget concern. + +## Principles + +We should optimize tokens under these rules: + +1. **Do not lose functionality.** Agents must still be able to resume work safely, understand why tasks exist, and act within governance rules. +2. **Prefer stable context over repeated context.** Unchanged instructions should not be resent through the most expensive path. +3. **Prefer deltas over full reloads.** Heartbeats should consume only what changed since the last useful run. +4. **Measure normalized deltas, not raw adapter claims.** Especially for sessioned CLIs. +5. **Keep escape hatches.** Board/manual runs may still want a forced fresh session. + +## Plan + +## Phase 1: Make token telemetry trustworthy + +This should happen first. + +### Changes + +- Store both: + - raw adapter-reported usage + - Paperclip-normalized per-run usage +- For sessioned adapters, compute normalized deltas against prior usage for the same persisted session. +- Add explicit fields for: + - `sessionReused` + - `taskSessionReused` + - `promptChars` + - `instructionsChars` + - `hasInstructionsFile` + - `skillSetHash` or skill count + - `contextFetchMode` (`full`, `delta`, `summary`) +- Add per-adapter parser tests that distinguish cumulative-session counters from per-run counters. + +### Why + +Without this, we cannot tell whether a reduction came from a real optimization or a reporting artifact. + +### Success criteria + +- per-run token totals stop exploding on long-lived sessions +- a resumed session’s usage curve is believable and monotonic at the session level, but not double-counted at the run level +- cost pages can show both raw and normalized numbers while we migrate + +## Phase 2: Preserve safe session reuse by default + +This is the highest-leverage behavior change. + +### Changes + +- Stop resetting task sessions on ordinary timer wakes. +- Keep resetting on: + - explicit manual “fresh run” invocations + - assignment changes + - workspace mismatch + - model mismatch / invalid resume errors +- Add an explicit wake flag like `forceFreshSession: true` when the board wants a reset. +- Record why a session was reused or reset in run metadata. + +### Why + +Timer wakes are the dominant heartbeat path. Resetting them destroys both session continuity and prompt cache reuse. + +### Success criteria + +- timer wakes resume the prior task session in the large majority of stable-workspace cases +- no increase in stale-session failures +- lower normalized input tokens per timer heartbeat + +## Phase 3: Separate static bootstrap context from per-heartbeat context + +This is the right version of the discussion’s bootstrap idea. + +### Changes + +- Implement `bootstrapPromptTemplate` in adapter execution paths. +- Use it only when starting a fresh session, not on resumed sessions. +- Keep `promptTemplate` intentionally small and stable: + - who I am + - what triggered this wake + - which task/comment/approval to prioritize +- Move long-lived setup text out of recurring per-run prompts where possible. +- Add UI guidance and warnings when `promptTemplate` contains high-churn or large inline content. + +### Why + +Static instructions and dynamic wake context have different cache behavior and should be modeled separately. + +### Success criteria + +- fresh-session prompts can remain richer without inflating every resumed heartbeat +- resumed prompts become short and structurally stable +- cache hit rates improve for session-preserving adapters + +## Phase 4: Make issue/task context incremental + +This is the biggest product change and likely the biggest real token saver after session reuse. + +### Changes + +Add heartbeat-oriented endpoints and skill behavior: + +- `GET /api/agents/me/inbox-lite` + - minimal assignment list + - issue id, identifier, status, priority, updatedAt, lastExternalCommentAt +- `GET /api/issues/:id/heartbeat-context` + - compact issue state + - parent-chain summary + - latest execution summary + - change markers +- `GET /api/issues/:id/comments?after=` or `?since=` + - return only new comments +- optional `GET /api/issues/:id/context-digest` + - server-generated compact summary for heartbeat use + +Update the `paperclip` skill so the default pattern becomes: + +1. fetch compact inbox +2. fetch compact task context +3. fetch only new comments unless this is the first read, a mention-triggered wake, or a cache miss +4. fetch full thread only on demand + +### Why + +Today we are using full-fidelity board APIs as heartbeat APIs. That is convenient but token-inefficient. + +### Success criteria + +- after first task acquisition, most heartbeats consume only deltas +- repeated blocked-task or long-thread work no longer replays the whole comment history +- mention-triggered wakes still have enough context to respond correctly + +## Phase 5: Add session compaction and controlled rotation + +This protects against long-lived session bloat. + +### Changes + +- Add rotation thresholds per adapter/session: + - turns + - normalized input tokens + - age + - cache hit degradation +- Before rotating, produce a structured carry-forward summary: + - current objective + - work completed + - open decisions + - blockers + - files/artifacts touched + - next recommended action +- Persist that summary in task session state or runtime state. +- Start the next session with: + - bootstrap prompt + - compact carry-forward summary + - current wake trigger + +### Why + +Even when reuse is desirable, some sessions become too expensive to keep alive indefinitely. + +### Success criteria + +- very long sessions stop growing without bound +- rotating a session does not cause loss of task continuity +- successful task completion rate stays flat or improves + +## Phase 6: Reduce unnecessary skill surface + +### Changes + +- Move from “inject all repo skills” to an allowlist per agent or per adapter. +- Default local runtime skill set should likely be: + - `paperclip` +- Add opt-in skills for specialized agents: + - `paperclip-create-agent` + - `para-memory-files` + - `create-agent-adapter` +- Expose active skill set in agent config and run metadata. + +### Why + +Most agents do not need adapter-authoring or memory-system skills on every run. + +### Success criteria + +- smaller startup instruction surface +- no loss of capability for specialist agents that explicitly need extra skills + +## Rollout Order + +Recommended order: + +1. telemetry normalization +2. timer-wake session reuse +3. bootstrap prompt implementation +4. heartbeat delta APIs + `paperclip` skill rewrite +5. session compaction/rotation +6. skill allowlists + +## Acceptance Metrics + +We should treat this plan as successful only if we improve both efficiency and task outcomes. + +Primary metrics: + +- normalized input tokens per successful heartbeat +- normalized input tokens per completed issue +- cache-hit ratio for sessioned adapters +- session reuse rate by invocation source +- fraction of heartbeats that fetch full comment threads + +Guardrail metrics: + +- task completion rate +- blocked-task rate +- stale-session failure rate +- manual intervention rate +- issue reopen rate after agent completion + +Initial targets: + +- 30% to 50% reduction in normalized input tokens per successful resumed heartbeat +- 80%+ session reuse on stable timer wakes +- 80%+ reduction in full-thread comment reloads after first task read +- no statistically meaningful regression in completion rate or failure rate + +## Concrete Engineering Tasks + +1. Add normalized usage fields and migration support for run analytics. +2. Patch sessioned adapter accounting to compute deltas from prior session totals. +3. Change `shouldResetTaskSessionForWake(...)` so timer wakes do not reset by default. +4. Implement `bootstrapPromptTemplate` end-to-end in adapter execution. +5. Add compact heartbeat context and incremental comment APIs. +6. Rewrite `skills/paperclip/SKILL.md` around delta-fetch behavior. +7. Add session rotation with carry-forward summaries. +8. Replace global skill injection with explicit allowlists. + +## Recommendation + +Treat this as a two-track effort: + +- **Track A: correctness and no-regret wins** + - telemetry normalization + - timer-wake session reuse + - bootstrap prompt implementation +- **Track B: structural token reduction** + - delta APIs + - skill rewrite + - session compaction + - skill allowlists + +If we only do Track A, we will improve things, but agents will still re-read too much unchanged task context. + +If we only do Track B without fixing telemetry first, we will not be able to prove the gains cleanly.