Files
paperclip/doc/plans/2026-03-13-TOKEN-OPTIMIZATION-PLAN.md
2026-03-13 09:40:43 -05:00

14 KiB
Raw Blame History

Token Optimization Plan

Date: 2026-03-13
Related discussion: https://github.com/paperclipai/paperclip/discussions/449

Goal

Reduce token consumption materially without reducing agent capability, control-plane visibility, or task completion quality.

This plan is based on:

  • the current V1 control-plane design
  • the current adapter and heartbeat implementation
  • the linked user discussion
  • local runtime data from the default Paperclip instance on 2026-03-13

Executive Summary

The discussion is directionally right about two things:

  1. We should preserve session and prompt-cache locality more aggressively.
  2. We should separate stable startup instructions from per-heartbeat dynamic context.

But that is not enough on its own.

After reviewing the code and local run data, the token problem appears to have four distinct causes:

  1. Measurement inflation on sessioned adapters. Some token counters, especially for codex_local, appear to be recorded as cumulative session totals instead of per-heartbeat deltas.
  2. Avoidable session resets. Task sessions are intentionally reset on timer wakes and manual wakes, which destroys cache locality for common heartbeat paths.
  3. Repeated context reacquisition. The paperclip skill tells agents to re-fetch assignments, issue details, ancestors, and full comment threads on every heartbeat. The API does not currently offer efficient delta-oriented alternatives.
  4. Large static instruction surfaces. Agent instruction files and globally injected skills are reintroduced at startup even when most of that content is unchanged and not needed for the current task.

The correct approach is:

  1. fix telemetry so we can trust the numbers
  2. preserve reuse where it is safe
  3. make context retrieval incremental
  4. add session compaction/rotation so long-lived sessions do not become progressively more expensive

Validated Findings

1. Token telemetry is at least partly overstated today

Observed from the local default instance:

  • heartbeat_runs: 11,360 runs between 2026-02-18 and 2026-03-13
  • summed usage_json.inputTokens: 2,272,142,368,952
  • summed usage_json.cachedInputTokens: 2,217,501,559,420

Those totals are not credible as true per-heartbeat usage for the observed prompt sizes.

Supporting evidence:

  • adapter.invoke.payload.prompt averages were small:
    • codex_local: ~193 chars average, 6,067 chars max
    • claude_local: ~160 chars average, 1,160 chars max
  • despite that, many codex_local runs report millions of input tokens
  • one reused Codex session in local data spans 3,607 runs and recorded inputTokens growing up to 1,155,283,166

Interpretation:

  • for sessioned adapters, especially Codex, we are likely storing usage reported by the runtime as a session total, not a per-run delta
  • this makes trend reporting, optimization work, and customer trust worse

This does not mean there is no real token problem. It means we need a trustworthy baseline before we can judge optimization impact.

2. Timer wakes currently throw away reusable task sessions

In server/src/services/heartbeat.ts, shouldResetTaskSessionForWake(...) returns true for:

  • wakeReason === "issue_assigned"
  • wakeSource === "timer"
  • manual on-demand wakes

That means many normal heartbeats skip saved task-session resume even when the workspace is stable.

Local data supports the impact:

  • timer/system runs: 6,587 total
  • only 976 had a previous session
  • only 963 ended with the same session

So timer wakes are the largest heartbeat path and are mostly not resuming prior task state.

3. We repeatedly ask agents to reload the same task context

The paperclip skill currently tells agents to do this on essentially every heartbeat:

  • fetch assignments
  • fetch issue details
  • fetch ancestor chain
  • fetch full issue comments

Current API shape reinforces that pattern:

  • GET /api/issues/:id/comments returns the full thread
  • there is no since, cursor, digest, or summary endpoint for heartbeat consumption
  • GET /api/issues/:id returns full enriched issue context, not a minimal delta payload

This is safe but expensive. It forces the model to repeatedly consume unchanged information.

4. Static instruction payloads are not separated cleanly from dynamic heartbeat prompts

The user discussion suggested a bootstrap prompt. That is the right direction.

Current state:

  • the UI exposes bootstrapPromptTemplate
  • adapter execution paths do not currently use it
  • several adapters prepend instructionsFilePath content directly into the per-run prompt or system prompt

Result:

  • stable instructions are re-sent or re-applied in the same path as dynamic heartbeat content
  • we are not deliberately optimizing for provider prompt caching

5. We inject more skill surface than most agents need

Local adapters inject repo skills into runtime skill directories.

Current repo skill sizes:

  • skills/paperclip/SKILL.md: 17,441 bytes
  • .agents/skills/create-agent-adapter/SKILL.md: 31,832 bytes
  • skills/paperclip-create-agent/SKILL.md: 4,718 bytes
  • skills/para-memory-files/SKILL.md: 3,978 bytes

That is nearly 58 KB of skill markdown before any company-specific instructions.

Not all of that is necessarily loaded into model context every run, but it increases startup surface area and should be treated as a token budget concern.

Principles

We should optimize tokens under these rules:

  1. Do not lose functionality. Agents must still be able to resume work safely, understand why tasks exist, and act within governance rules.
  2. Prefer stable context over repeated context. Unchanged instructions should not be resent through the most expensive path.
  3. Prefer deltas over full reloads. Heartbeats should consume only what changed since the last useful run.
  4. Measure normalized deltas, not raw adapter claims. Especially for sessioned CLIs.
  5. Keep escape hatches. Board/manual runs may still want a forced fresh session.

Plan

Phase 1: Make token telemetry trustworthy

This should happen first.

Changes

  • Store both:
    • raw adapter-reported usage
    • Paperclip-normalized per-run usage
  • For sessioned adapters, compute normalized deltas against prior usage for the same persisted session.
  • Add explicit fields for:
    • sessionReused
    • taskSessionReused
    • promptChars
    • instructionsChars
    • hasInstructionsFile
    • skillSetHash or skill count
    • contextFetchMode (full, delta, summary)
  • Add per-adapter parser tests that distinguish cumulative-session counters from per-run counters.

Why

Without this, we cannot tell whether a reduction came from a real optimization or a reporting artifact.

Success criteria

  • per-run token totals stop exploding on long-lived sessions
  • a resumed sessions usage curve is believable and monotonic at the session level, but not double-counted at the run level
  • cost pages can show both raw and normalized numbers while we migrate

Phase 2: Preserve safe session reuse by default

This is the highest-leverage behavior change.

Changes

  • Stop resetting task sessions on ordinary timer wakes.
  • Keep resetting on:
    • explicit manual “fresh run” invocations
    • assignment changes
    • workspace mismatch
    • model mismatch / invalid resume errors
  • Add an explicit wake flag like forceFreshSession: true when the board wants a reset.
  • Record why a session was reused or reset in run metadata.

Why

Timer wakes are the dominant heartbeat path. Resetting them destroys both session continuity and prompt cache reuse.

Success criteria

  • timer wakes resume the prior task session in the large majority of stable-workspace cases
  • no increase in stale-session failures
  • lower normalized input tokens per timer heartbeat

Phase 3: Separate static bootstrap context from per-heartbeat context

This is the right version of the discussions bootstrap idea.

Changes

  • Implement bootstrapPromptTemplate in adapter execution paths.
  • Use it only when starting a fresh session, not on resumed sessions.
  • Keep promptTemplate intentionally small and stable:
    • who I am
    • what triggered this wake
    • which task/comment/approval to prioritize
  • Move long-lived setup text out of recurring per-run prompts where possible.
  • Add UI guidance and warnings when promptTemplate contains high-churn or large inline content.

Why

Static instructions and dynamic wake context have different cache behavior and should be modeled separately.

Success criteria

  • fresh-session prompts can remain richer without inflating every resumed heartbeat
  • resumed prompts become short and structurally stable
  • cache hit rates improve for session-preserving adapters

Phase 4: Make issue/task context incremental

This is the biggest product change and likely the biggest real token saver after session reuse.

Changes

Add heartbeat-oriented endpoints and skill behavior:

  • GET /api/agents/me/inbox-lite
    • minimal assignment list
    • issue id, identifier, status, priority, updatedAt, lastExternalCommentAt
  • GET /api/issues/:id/heartbeat-context
    • compact issue state
    • parent-chain summary
    • latest execution summary
    • change markers
  • GET /api/issues/:id/comments?after=<cursor> or ?since=<timestamp>
    • return only new comments
  • optional GET /api/issues/:id/context-digest
    • server-generated compact summary for heartbeat use

Update the paperclip skill so the default pattern becomes:

  1. fetch compact inbox
  2. fetch compact task context
  3. fetch only new comments unless this is the first read, a mention-triggered wake, or a cache miss
  4. fetch full thread only on demand

Why

Today we are using full-fidelity board APIs as heartbeat APIs. That is convenient but token-inefficient.

Success criteria

  • after first task acquisition, most heartbeats consume only deltas
  • repeated blocked-task or long-thread work no longer replays the whole comment history
  • mention-triggered wakes still have enough context to respond correctly

Phase 5: Add session compaction and controlled rotation

This protects against long-lived session bloat.

Changes

  • Add rotation thresholds per adapter/session:
    • turns
    • normalized input tokens
    • age
    • cache hit degradation
  • Before rotating, produce a structured carry-forward summary:
    • current objective
    • work completed
    • open decisions
    • blockers
    • files/artifacts touched
    • next recommended action
  • Persist that summary in task session state or runtime state.
  • Start the next session with:
    • bootstrap prompt
    • compact carry-forward summary
    • current wake trigger

Why

Even when reuse is desirable, some sessions become too expensive to keep alive indefinitely.

Success criteria

  • very long sessions stop growing without bound
  • rotating a session does not cause loss of task continuity
  • successful task completion rate stays flat or improves

Phase 6: Reduce unnecessary skill surface

Changes

  • Move from “inject all repo skills” to an allowlist per agent or per adapter.
  • Default local runtime skill set should likely be:
    • paperclip
  • Add opt-in skills for specialized agents:
    • paperclip-create-agent
    • para-memory-files
    • create-agent-adapter
  • Expose active skill set in agent config and run metadata.

Why

Most agents do not need adapter-authoring or memory-system skills on every run.

Success criteria

  • smaller startup instruction surface
  • no loss of capability for specialist agents that explicitly need extra skills

Rollout Order

Recommended order:

  1. telemetry normalization
  2. timer-wake session reuse
  3. bootstrap prompt implementation
  4. heartbeat delta APIs + paperclip skill rewrite
  5. session compaction/rotation
  6. skill allowlists

Acceptance Metrics

We should treat this plan as successful only if we improve both efficiency and task outcomes.

Primary metrics:

  • normalized input tokens per successful heartbeat
  • normalized input tokens per completed issue
  • cache-hit ratio for sessioned adapters
  • session reuse rate by invocation source
  • fraction of heartbeats that fetch full comment threads

Guardrail metrics:

  • task completion rate
  • blocked-task rate
  • stale-session failure rate
  • manual intervention rate
  • issue reopen rate after agent completion

Initial targets:

  • 30% to 50% reduction in normalized input tokens per successful resumed heartbeat
  • 80%+ session reuse on stable timer wakes
  • 80%+ reduction in full-thread comment reloads after first task read
  • no statistically meaningful regression in completion rate or failure rate

Concrete Engineering Tasks

  1. Add normalized usage fields and migration support for run analytics.
  2. Patch sessioned adapter accounting to compute deltas from prior session totals.
  3. Change shouldResetTaskSessionForWake(...) so timer wakes do not reset by default.
  4. Implement bootstrapPromptTemplate end-to-end in adapter execution.
  5. Add compact heartbeat context and incremental comment APIs.
  6. Rewrite skills/paperclip/SKILL.md around delta-fetch behavior.
  7. Add session rotation with carry-forward summaries.
  8. Replace global skill injection with explicit allowlists.

Recommendation

Treat this as a two-track effort:

  • Track A: correctness and no-regret wins
    • telemetry normalization
    • timer-wake session reuse
    • bootstrap prompt implementation
  • Track B: structural token reduction
    • delta APIs
    • skill rewrite
    • session compaction
    • skill allowlists

If we only do Track A, we will improve things, but agents will still re-read too much unchanged task context.

If we only do Track B without fixing telemetry first, we will not be able to prove the gains cleanly.