docs: add token optimization plan

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-03-13 08:04:57 -05:00
parent 9c7d9ded1e
commit 84fc6d4a87
1 changed files with 383 additions and 0 deletions
--- a/doc/plans/2026-03-13-TOKEN-OPTIMIZATION-PLAN.md
+++ b/doc/plans/2026-03-13-TOKEN-OPTIMIZATION-PLAN.md
@@ -0,0 +1,383 @@
+# Token Optimization Plan
+
+Date: 2026-03-13  
+Related discussion: https://github.com/paperclipai/paperclip/discussions/449
+
+## Goal
+
+Reduce token consumption materially without reducing agent capability, control-plane visibility, or task completion quality.
+
+This plan is based on:
+
+- the current V1 control-plane design
+- the current adapter and heartbeat implementation
+- the linked user discussion
+- local runtime data from the default Paperclip instance on 2026-03-13
+
+## Executive Summary
+
+The discussion is directionally right about two things:
+
+1. We should preserve session and prompt-cache locality more aggressively.
+2. We should separate stable startup instructions from per-heartbeat dynamic context.
+
+But that is not enough on its own.
+
+After reviewing the code and local run data, the token problem appears to have four distinct causes:
+
+1. **Measurement inflation on sessioned adapters.** Some token counters, especially for `codex_local`, appear to be recorded as cumulative session totals instead of per-heartbeat deltas.
+2. **Avoidable session resets.** Task sessions are intentionally reset on timer wakes and manual wakes, which destroys cache locality for common heartbeat paths.
+3. **Repeated context reacquisition.** The `paperclip` skill tells agents to re-fetch assignments, issue details, ancestors, and full comment threads on every heartbeat. The API does not currently offer efficient delta-oriented alternatives.
+4. **Large static instruction surfaces.** Agent instruction files and globally injected skills are reintroduced at startup even when most of that content is unchanged and not needed for the current task.
+
+The correct approach is:
+
+1. fix telemetry so we can trust the numbers
+2. preserve reuse where it is safe
+3. make context retrieval incremental
+4. add session compaction/rotation so long-lived sessions do not become progressively more expensive
+
+## Validated Findings
+
+### 1. Token telemetry is at least partly overstated today
+
+Observed from the local default instance:
+
+- `heartbeat_runs`: 11,360 runs between 2026-02-18 and 2026-03-13
+- summed `usage_json.inputTokens`: `2,272,142,368,952`
+- summed `usage_json.cachedInputTokens`: `2,217,501,559,420`
+
+Those totals are not credible as true per-heartbeat usage for the observed prompt sizes.
+
+Supporting evidence:
+
+- `adapter.invoke.payload.prompt` averages were small:
+  - `codex_local`: ~193 chars average, 6,067 chars max
+  - `claude_local`: ~160 chars average, 1,160 chars max
+- despite that, many `codex_local` runs report millions of input tokens
+- one reused Codex session in local data spans 3,607 runs and recorded `inputTokens` growing up to `1,155,283,166`
+
+Interpretation:
+
+- for sessioned adapters, especially Codex, we are likely storing usage reported by the runtime as a **session total**, not a **per-run delta**
+- this makes trend reporting, optimization work, and customer trust worse
+
+This does **not** mean there is no real token problem. It means we need a trustworthy baseline before we can judge optimization impact.
+
+### 2. Timer wakes currently throw away reusable task sessions
+
+In `server/src/services/heartbeat.ts`, `shouldResetTaskSessionForWake(...)` returns `true` for:
+
+- `wakeReason === "issue_assigned"`
+- `wakeSource === "timer"`
+- manual on-demand wakes
+
+That means many normal heartbeats skip saved task-session resume even when the workspace is stable.
+
+Local data supports the impact:
+
+- `timer/system` runs: 6,587 total
+- only 976 had a previous session
+- only 963 ended with the same session
+
+So timer wakes are the largest heartbeat path and are mostly not resuming prior task state.
+
+### 3. We repeatedly ask agents to reload the same task context
+
+The `paperclip` skill currently tells agents to do this on essentially every heartbeat:
+
+- fetch assignments
+- fetch issue details
+- fetch ancestor chain
+- fetch full issue comments
+
+Current API shape reinforces that pattern:
+
+- `GET /api/issues/:id/comments` returns the full thread
+- there is no `since`, cursor, digest, or summary endpoint for heartbeat consumption
+- `GET /api/issues/:id` returns full enriched issue context, not a minimal delta payload
+
+This is safe but expensive. It forces the model to repeatedly consume unchanged information.
+
+### 4. Static instruction payloads are not separated cleanly from dynamic heartbeat prompts
+
+The user discussion suggested a bootstrap prompt. That is the right direction.
+
+Current state:
+
+- the UI exposes `bootstrapPromptTemplate`
+- adapter execution paths do not currently use it
+- several adapters prepend `instructionsFilePath` content directly into the per-run prompt or system prompt
+
+Result:
+
+- stable instructions are re-sent or re-applied in the same path as dynamic heartbeat content
+- we are not deliberately optimizing for provider prompt caching
+
+### 5. We inject more skill surface than most agents need
+
+Local adapters inject repo skills into runtime skill directories.
+
+Current repo skill sizes:
+
+- `skills/paperclip/SKILL.md`: 17,441 bytes
+- `skills/create-agent-adapter/SKILL.md`: 31,832 bytes
+- `skills/paperclip-create-agent/SKILL.md`: 4,718 bytes
+- `skills/para-memory-files/SKILL.md`: 3,978 bytes
+
+That is nearly 58 KB of skill markdown before any company-specific instructions.
+
+Not all of that is necessarily loaded into model context every run, but it increases startup surface area and should be treated as a token budget concern.
+
+## Principles
+
+We should optimize tokens under these rules:
+
+1. **Do not lose functionality.** Agents must still be able to resume work safely, understand why tasks exist, and act within governance rules.
+2. **Prefer stable context over repeated context.** Unchanged instructions should not be resent through the most expensive path.
+3. **Prefer deltas over full reloads.** Heartbeats should consume only what changed since the last useful run.
+4. **Measure normalized deltas, not raw adapter claims.** Especially for sessioned CLIs.
+5. **Keep escape hatches.** Board/manual runs may still want a forced fresh session.
+
+## Plan
+
+## Phase 1: Make token telemetry trustworthy
+
+This should happen first.
+
+### Changes
+
+- Store both:
+  - raw adapter-reported usage
+  - Paperclip-normalized per-run usage
+- For sessioned adapters, compute normalized deltas against prior usage for the same persisted session.
+- Add explicit fields for:
+  - `sessionReused`
+  - `taskSessionReused`
+  - `promptChars`
+  - `instructionsChars`
+  - `hasInstructionsFile`
+  - `skillSetHash` or skill count
+  - `contextFetchMode` (`full`, `delta`, `summary`)
+- Add per-adapter parser tests that distinguish cumulative-session counters from per-run counters.
+
+### Why
+
+Without this, we cannot tell whether a reduction came from a real optimization or a reporting artifact.
+
+### Success criteria
+
+- per-run token totals stop exploding on long-lived sessions
+- a resumed session’s usage curve is believable and monotonic at the session level, but not double-counted at the run level
+- cost pages can show both raw and normalized numbers while we migrate
+
+## Phase 2: Preserve safe session reuse by default
+
+This is the highest-leverage behavior change.
+
+### Changes
+
+- Stop resetting task sessions on ordinary timer wakes.
+- Keep resetting on:
+  - explicit manual “fresh run” invocations
+  - assignment changes
+  - workspace mismatch
+  - model mismatch / invalid resume errors
+- Add an explicit wake flag like `forceFreshSession: true` when the board wants a reset.
+- Record why a session was reused or reset in run metadata.
+
+### Why
+
+Timer wakes are the dominant heartbeat path. Resetting them destroys both session continuity and prompt cache reuse.
+
+### Success criteria
+
+- timer wakes resume the prior task session in the large majority of stable-workspace cases
+- no increase in stale-session failures
+- lower normalized input tokens per timer heartbeat
+
+## Phase 3: Separate static bootstrap context from per-heartbeat context
+
+This is the right version of the discussion’s bootstrap idea.
+
+### Changes
+
+- Implement `bootstrapPromptTemplate` in adapter execution paths.
+- Use it only when starting a fresh session, not on resumed sessions.
+- Keep `promptTemplate` intentionally small and stable:
+  - who I am
+  - what triggered this wake
+  - which task/comment/approval to prioritize
+- Move long-lived setup text out of recurring per-run prompts where possible.
+- Add UI guidance and warnings when `promptTemplate` contains high-churn or large inline content.
+
+### Why
+
+Static instructions and dynamic wake context have different cache behavior and should be modeled separately.
+
+### Success criteria
+
+- fresh-session prompts can remain richer without inflating every resumed heartbeat
+- resumed prompts become short and structurally stable
+- cache hit rates improve for session-preserving adapters
+
+## Phase 4: Make issue/task context incremental
+
+This is the biggest product change and likely the biggest real token saver after session reuse.
+
+### Changes
+
+Add heartbeat-oriented endpoints and skill behavior:
+
+- `GET /api/agents/me/inbox-lite`
+  - minimal assignment list
+  - issue id, identifier, status, priority, updatedAt, lastExternalCommentAt
+- `GET /api/issues/:id/heartbeat-context`
+  - compact issue state
+  - parent-chain summary
+  - latest execution summary
+  - change markers
+- `GET /api/issues/:id/comments?after=<cursor>` or `?since=<timestamp>`
+  - return only new comments
+- optional `GET /api/issues/:id/context-digest`
+  - server-generated compact summary for heartbeat use
+
+Update the `paperclip` skill so the default pattern becomes:
+
+1. fetch compact inbox
+2. fetch compact task context
+3. fetch only new comments unless this is the first read, a mention-triggered wake, or a cache miss
+4. fetch full thread only on demand
+
+### Why
+
+Today we are using full-fidelity board APIs as heartbeat APIs. That is convenient but token-inefficient.
+
+### Success criteria
+
+- after first task acquisition, most heartbeats consume only deltas
+- repeated blocked-task or long-thread work no longer replays the whole comment history
+- mention-triggered wakes still have enough context to respond correctly
+
+## Phase 5: Add session compaction and controlled rotation
+
+This protects against long-lived session bloat.
+
+### Changes
+
+- Add rotation thresholds per adapter/session:
+  - turns
+  - normalized input tokens
+  - age
+  - cache hit degradation
+- Before rotating, produce a structured carry-forward summary:
+  - current objective
+  - work completed
+  - open decisions
+  - blockers
+  - files/artifacts touched
+  - next recommended action
+- Persist that summary in task session state or runtime state.
+- Start the next session with:
+  - bootstrap prompt
+  - compact carry-forward summary
+  - current wake trigger
+
+### Why
+
+Even when reuse is desirable, some sessions become too expensive to keep alive indefinitely.
+
+### Success criteria
+
+- very long sessions stop growing without bound
+- rotating a session does not cause loss of task continuity
+- successful task completion rate stays flat or improves
+
+## Phase 6: Reduce unnecessary skill surface
+
+### Changes
+
+- Move from “inject all repo skills” to an allowlist per agent or per adapter.
+- Default local runtime skill set should likely be:
+  - `paperclip`
+- Add opt-in skills for specialized agents:
+  - `paperclip-create-agent`
+  - `para-memory-files`
+  - `create-agent-adapter`
+- Expose active skill set in agent config and run metadata.
+
+### Why
+
+Most agents do not need adapter-authoring or memory-system skills on every run.
+
+### Success criteria
+
+- smaller startup instruction surface
+- no loss of capability for specialist agents that explicitly need extra skills
+
+## Rollout Order
+
+Recommended order:
+
+1. telemetry normalization
+2. timer-wake session reuse
+3. bootstrap prompt implementation
+4. heartbeat delta APIs + `paperclip` skill rewrite
+5. session compaction/rotation
+6. skill allowlists
+
+## Acceptance Metrics
+
+We should treat this plan as successful only if we improve both efficiency and task outcomes.
+
+Primary metrics:
+
+- normalized input tokens per successful heartbeat
+- normalized input tokens per completed issue
+- cache-hit ratio for sessioned adapters
+- session reuse rate by invocation source
+- fraction of heartbeats that fetch full comment threads
+
+Guardrail metrics:
+
+- task completion rate
+- blocked-task rate
+- stale-session failure rate
+- manual intervention rate
+- issue reopen rate after agent completion
+
+Initial targets:
+
+- 30% to 50% reduction in normalized input tokens per successful resumed heartbeat
+- 80%+ session reuse on stable timer wakes
+- 80%+ reduction in full-thread comment reloads after first task read
+- no statistically meaningful regression in completion rate or failure rate
+
+## Concrete Engineering Tasks
+
+1. Add normalized usage fields and migration support for run analytics.
+2. Patch sessioned adapter accounting to compute deltas from prior session totals.
+3. Change `shouldResetTaskSessionForWake(...)` so timer wakes do not reset by default.
+4. Implement `bootstrapPromptTemplate` end-to-end in adapter execution.
+5. Add compact heartbeat context and incremental comment APIs.
+6. Rewrite `skills/paperclip/SKILL.md` around delta-fetch behavior.
+7. Add session rotation with carry-forward summaries.
+8. Replace global skill injection with explicit allowlists.
+
+## Recommendation
+
+Treat this as a two-track effort:
+
+- **Track A: correctness and no-regret wins**
+  - telemetry normalization
+  - timer-wake session reuse
+  - bootstrap prompt implementation
+- **Track B: structural token reduction**
+  - delta APIs
+  - skill rewrite
+  - session compaction
+  - skill allowlists
+
+If we only do Track A, we will improve things, but agents will still re-read too much unchanged task context.
+
+If we only do Track B without fixing telemetry first, we will not be able to prove the gains cleanly.