14 KiB
Token Optimization Plan
Date: 2026-03-13
Related discussion: https://github.com/paperclipai/paperclip/discussions/449
Goal
Reduce token consumption materially without reducing agent capability, control-plane visibility, or task completion quality.
This plan is based on:
- the current V1 control-plane design
- the current adapter and heartbeat implementation
- the linked user discussion
- local runtime data from the default Paperclip instance on 2026-03-13
Executive Summary
The discussion is directionally right about two things:
- We should preserve session and prompt-cache locality more aggressively.
- We should separate stable startup instructions from per-heartbeat dynamic context.
But that is not enough on its own.
After reviewing the code and local run data, the token problem appears to have four distinct causes:
- Measurement inflation on sessioned adapters. Some token counters, especially for
codex_local, appear to be recorded as cumulative session totals instead of per-heartbeat deltas. - Avoidable session resets. Task sessions are intentionally reset on timer wakes and manual wakes, which destroys cache locality for common heartbeat paths.
- Repeated context reacquisition. The
paperclipskill tells agents to re-fetch assignments, issue details, ancestors, and full comment threads on every heartbeat. The API does not currently offer efficient delta-oriented alternatives. - Large static instruction surfaces. Agent instruction files and globally injected skills are reintroduced at startup even when most of that content is unchanged and not needed for the current task.
The correct approach is:
- fix telemetry so we can trust the numbers
- preserve reuse where it is safe
- make context retrieval incremental
- add session compaction/rotation so long-lived sessions do not become progressively more expensive
Validated Findings
1. Token telemetry is at least partly overstated today
Observed from the local default instance:
heartbeat_runs: 11,360 runs between 2026-02-18 and 2026-03-13- summed
usage_json.inputTokens:2,272,142,368,952 - summed
usage_json.cachedInputTokens:2,217,501,559,420
Those totals are not credible as true per-heartbeat usage for the observed prompt sizes.
Supporting evidence:
adapter.invoke.payload.promptaverages were small:codex_local: ~193 chars average, 6,067 chars maxclaude_local: ~160 chars average, 1,160 chars max
- despite that, many
codex_localruns report millions of input tokens - one reused Codex session in local data spans 3,607 runs and recorded
inputTokensgrowing up to1,155,283,166
Interpretation:
- for sessioned adapters, especially Codex, we are likely storing usage reported by the runtime as a session total, not a per-run delta
- this makes trend reporting, optimization work, and customer trust worse
This does not mean there is no real token problem. It means we need a trustworthy baseline before we can judge optimization impact.
2. Timer wakes currently throw away reusable task sessions
In server/src/services/heartbeat.ts, shouldResetTaskSessionForWake(...) returns true for:
wakeReason === "issue_assigned"wakeSource === "timer"- manual on-demand wakes
That means many normal heartbeats skip saved task-session resume even when the workspace is stable.
Local data supports the impact:
timer/systemruns: 6,587 total- only 976 had a previous session
- only 963 ended with the same session
So timer wakes are the largest heartbeat path and are mostly not resuming prior task state.
3. We repeatedly ask agents to reload the same task context
The paperclip skill currently tells agents to do this on essentially every heartbeat:
- fetch assignments
- fetch issue details
- fetch ancestor chain
- fetch full issue comments
Current API shape reinforces that pattern:
GET /api/issues/:id/commentsreturns the full thread- there is no
since, cursor, digest, or summary endpoint for heartbeat consumption GET /api/issues/:idreturns full enriched issue context, not a minimal delta payload
This is safe but expensive. It forces the model to repeatedly consume unchanged information.
4. Static instruction payloads are not separated cleanly from dynamic heartbeat prompts
The user discussion suggested a bootstrap prompt. That is the right direction.
Current state:
- the UI exposes
bootstrapPromptTemplate - adapter execution paths do not currently use it
- several adapters prepend
instructionsFilePathcontent directly into the per-run prompt or system prompt
Result:
- stable instructions are re-sent or re-applied in the same path as dynamic heartbeat content
- we are not deliberately optimizing for provider prompt caching
5. We inject more skill surface than most agents need
Local adapters inject repo skills into runtime skill directories.
Current repo skill sizes:
skills/paperclip/SKILL.md: 17,441 bytes.agents/skills/create-agent-adapter/SKILL.md: 31,832 bytesskills/paperclip-create-agent/SKILL.md: 4,718 bytesskills/para-memory-files/SKILL.md: 3,978 bytes
That is nearly 58 KB of skill markdown before any company-specific instructions.
Not all of that is necessarily loaded into model context every run, but it increases startup surface area and should be treated as a token budget concern.
Principles
We should optimize tokens under these rules:
- Do not lose functionality. Agents must still be able to resume work safely, understand why tasks exist, and act within governance rules.
- Prefer stable context over repeated context. Unchanged instructions should not be resent through the most expensive path.
- Prefer deltas over full reloads. Heartbeats should consume only what changed since the last useful run.
- Measure normalized deltas, not raw adapter claims. Especially for sessioned CLIs.
- Keep escape hatches. Board/manual runs may still want a forced fresh session.
Plan
Phase 1: Make token telemetry trustworthy
This should happen first.
Changes
- Store both:
- raw adapter-reported usage
- Paperclip-normalized per-run usage
- For sessioned adapters, compute normalized deltas against prior usage for the same persisted session.
- Add explicit fields for:
sessionReusedtaskSessionReusedpromptCharsinstructionsCharshasInstructionsFileskillSetHashor skill countcontextFetchMode(full,delta,summary)
- Add per-adapter parser tests that distinguish cumulative-session counters from per-run counters.
Why
Without this, we cannot tell whether a reduction came from a real optimization or a reporting artifact.
Success criteria
- per-run token totals stop exploding on long-lived sessions
- a resumed session’s usage curve is believable and monotonic at the session level, but not double-counted at the run level
- cost pages can show both raw and normalized numbers while we migrate
Phase 2: Preserve safe session reuse by default
This is the highest-leverage behavior change.
Changes
- Stop resetting task sessions on ordinary timer wakes.
- Keep resetting on:
- explicit manual “fresh run” invocations
- assignment changes
- workspace mismatch
- model mismatch / invalid resume errors
- Add an explicit wake flag like
forceFreshSession: truewhen the board wants a reset. - Record why a session was reused or reset in run metadata.
Why
Timer wakes are the dominant heartbeat path. Resetting them destroys both session continuity and prompt cache reuse.
Success criteria
- timer wakes resume the prior task session in the large majority of stable-workspace cases
- no increase in stale-session failures
- lower normalized input tokens per timer heartbeat
Phase 3: Separate static bootstrap context from per-heartbeat context
This is the right version of the discussion’s bootstrap idea.
Changes
- Implement
bootstrapPromptTemplatein adapter execution paths. - Use it only when starting a fresh session, not on resumed sessions.
- Keep
promptTemplateintentionally small and stable:- who I am
- what triggered this wake
- which task/comment/approval to prioritize
- Move long-lived setup text out of recurring per-run prompts where possible.
- Add UI guidance and warnings when
promptTemplatecontains high-churn or large inline content.
Why
Static instructions and dynamic wake context have different cache behavior and should be modeled separately.
Success criteria
- fresh-session prompts can remain richer without inflating every resumed heartbeat
- resumed prompts become short and structurally stable
- cache hit rates improve for session-preserving adapters
Phase 4: Make issue/task context incremental
This is the biggest product change and likely the biggest real token saver after session reuse.
Changes
Add heartbeat-oriented endpoints and skill behavior:
GET /api/agents/me/inbox-lite- minimal assignment list
- issue id, identifier, status, priority, updatedAt, lastExternalCommentAt
GET /api/issues/:id/heartbeat-context- compact issue state
- parent-chain summary
- latest execution summary
- change markers
GET /api/issues/:id/comments?after=<cursor>or?since=<timestamp>- return only new comments
- optional
GET /api/issues/:id/context-digest- server-generated compact summary for heartbeat use
Update the paperclip skill so the default pattern becomes:
- fetch compact inbox
- fetch compact task context
- fetch only new comments unless this is the first read, a mention-triggered wake, or a cache miss
- fetch full thread only on demand
Why
Today we are using full-fidelity board APIs as heartbeat APIs. That is convenient but token-inefficient.
Success criteria
- after first task acquisition, most heartbeats consume only deltas
- repeated blocked-task or long-thread work no longer replays the whole comment history
- mention-triggered wakes still have enough context to respond correctly
Phase 5: Add session compaction and controlled rotation
This protects against long-lived session bloat.
Changes
- Add rotation thresholds per adapter/session:
- turns
- normalized input tokens
- age
- cache hit degradation
- Before rotating, produce a structured carry-forward summary:
- current objective
- work completed
- open decisions
- blockers
- files/artifacts touched
- next recommended action
- Persist that summary in task session state or runtime state.
- Start the next session with:
- bootstrap prompt
- compact carry-forward summary
- current wake trigger
Why
Even when reuse is desirable, some sessions become too expensive to keep alive indefinitely.
Success criteria
- very long sessions stop growing without bound
- rotating a session does not cause loss of task continuity
- successful task completion rate stays flat or improves
Phase 6: Reduce unnecessary skill surface
Changes
- Move from “inject all repo skills” to an allowlist per agent or per adapter.
- Default local runtime skill set should likely be:
paperclip
- Add opt-in skills for specialized agents:
paperclip-create-agentpara-memory-filescreate-agent-adapter
- Expose active skill set in agent config and run metadata.
Why
Most agents do not need adapter-authoring or memory-system skills on every run.
Success criteria
- smaller startup instruction surface
- no loss of capability for specialist agents that explicitly need extra skills
Rollout Order
Recommended order:
- telemetry normalization
- timer-wake session reuse
- bootstrap prompt implementation
- heartbeat delta APIs +
paperclipskill rewrite - session compaction/rotation
- skill allowlists
Acceptance Metrics
We should treat this plan as successful only if we improve both efficiency and task outcomes.
Primary metrics:
- normalized input tokens per successful heartbeat
- normalized input tokens per completed issue
- cache-hit ratio for sessioned adapters
- session reuse rate by invocation source
- fraction of heartbeats that fetch full comment threads
Guardrail metrics:
- task completion rate
- blocked-task rate
- stale-session failure rate
- manual intervention rate
- issue reopen rate after agent completion
Initial targets:
- 30% to 50% reduction in normalized input tokens per successful resumed heartbeat
- 80%+ session reuse on stable timer wakes
- 80%+ reduction in full-thread comment reloads after first task read
- no statistically meaningful regression in completion rate or failure rate
Concrete Engineering Tasks
- Add normalized usage fields and migration support for run analytics.
- Patch sessioned adapter accounting to compute deltas from prior session totals.
- Change
shouldResetTaskSessionForWake(...)so timer wakes do not reset by default. - Implement
bootstrapPromptTemplateend-to-end in adapter execution. - Add compact heartbeat context and incremental comment APIs.
- Rewrite
skills/paperclip/SKILL.mdaround delta-fetch behavior. - Add session rotation with carry-forward summaries.
- Replace global skill injection with explicit allowlists.
Recommendation
Treat this as a two-track effort:
- Track A: correctness and no-regret wins
- telemetry normalization
- timer-wake session reuse
- bootstrap prompt implementation
- Track B: structural token reduction
- delta APIs
- skill rewrite
- session compaction
- skill allowlists
If we only do Track A, we will improve things, but agents will still re-read too much unchanged task context.
If we only do Track B without fixing telemetry first, we will not be able to prove the gains cleanly.