Merge remote-tracking branch 'public-gh/master' into paperclip-subissues

* public-gh/master: (55 commits)
  fix(issue-documents): address greptile review
  Update packages/shared/src/validators/issue.ts
  feat(ui): add issue document copy and download actions
  fix(ui): unify new issue upload action
  feat(ui): stage issue files before create
  feat(ui): handle issue document edit conflicts
  fix(ui): refresh issue documents from live events
  feat(ui): deep link issue documents
  fix(ui): streamline issue document chrome
  fix(ui): collapse empty document and attachment states
  fix(ui): simplify document card body layout
  fix(issues): address document review comments
  feat(issues): add issue documents and inline editing
  docs: add agent evals framework plan
  fix(cli): quote env values with special characters
  Fix worktree seed source selection
  fix: address greptile follow-up
  docs: add paperclip skill tightening plan
  fix: isolate codex home in worktrees
  Add worktree UI branding
  ...

# Conflicts:
#	packages/db/src/migrations/meta/0028_snapshot.json
#	packages/db/src/migrations/meta/_journal.json
#	packages/shared/src/index.ts
#	server/src/routes/issues.ts
#	ui/src/api/issues.ts
#	ui/src/components/NewIssueDialog.tsx
#	ui/src/pages/IssueDetail.tsx
This commit is contained in:
Dotta
2026-03-14 12:24:40 -05:00
136 changed files with 17867 additions and 1511 deletions

View File

@@ -0,0 +1,643 @@
# Humans and Permissions Implementation (V1)
Status: Draft
Date: 2026-02-21
Owners: Server + UI + CLI + DB + Shared
Companion plan: `doc/plan/humans-and-permissions.md`
## 1. Document role
This document is the engineering implementation contract for the humans-and-permissions plan.
It translates product decisions into concrete schema, API, middleware, UI, CLI, and test work.
If this document conflicts with prior exploratory notes, this document wins for V1 execution.
## 2. Locked V1 decisions
1. Two deployment modes remain:
- `local_trusted`
- `cloud_hosted`
2. `local_trusted`:
- no login UX
- implicit local instance admin actor
- loopback-only server binding
- full admin/settings/invite/approval capabilities available locally
3. `cloud_hosted`:
- Better Auth for humans
- email/password only
- no email verification requirement in V1
4. Permissions:
- one shared authorization system for humans and agents
- normalized grants table (`principal_permission_grants`)
- no separate “agent permissions engine”
5. Invites:
- copy-link only (no outbound email sending in V1)
- unified `company_join` link that supports human or agent path
- acceptance creates `pending_approval` join request
- no access until admin approval
6. Join review metadata:
- source IP required
- no GeoIP/country lookup in V1
7. Agent API keys:
- indefinite by default
- hash at rest
- display once on claim
- revoke/regenerate supported
8. Local ingress:
- public/untrusted ingress is out of scope for V1
- no `--dangerous-agent-ingress` in V1
## 3. Current baseline and delta
Current baseline (repo as of this doc):
- server actor model defaults to `board` in `server/src/middleware/auth.ts`
- authorization is mostly `assertBoard` + company check in `server/src/routes/authz.ts`
- no human auth/session tables in local schema
- no principal membership or grants tables
- no invite or join-request lifecycle
Required delta:
- move from board-vs-agent authz to principal-based authz
- add Better Auth integration in cloud mode
- add membership/grants/invite/join-request persistence
- add approval inbox signals and actions
- preserve local no-login UX without weakening cloud security
## 4. Architecture
## 4.1 Deployment mode contract
Add explicit runtime mode:
- `deployment.mode = local_trusted | cloud_hosted`
Config behavior:
- mode stored in config file (`packages/shared/src/config-schema.ts`)
- loaded in server config (`server/src/config.ts`)
- surfaced in `/api/health`
Startup guardrails:
- `local_trusted`: fail startup if bind host is not loopback
- `cloud_hosted`: fail startup if Better Auth is not configured
## 4.2 Actor model
Replace implicit “board” semantics with explicit actors:
- `user` (session-authenticated human)
- `agent` (bearer API key)
- `local_implicit_admin` (local_trusted only)
Implementation note:
- keep `req.actor` shape backward-compatible during migration by introducing a normalizer helper
- remove hard-coded `"board"` checks route-by-route after new authz helpers are in place
## 4.3 Authorization model
Authorization input tuple:
- `(company_id, principal_type, principal_id, permission_key, scope_payload)`
Principal types:
- `user`
- `agent`
Role layers:
- `instance_admin` (instance-wide)
- company-scoped grants via `principal_permission_grants`
Evaluation order:
1. resolve principal from actor
2. resolve instance role (`instance_admin` short-circuit for admin-only actions)
3. resolve company membership (`active` required for company access)
4. resolve grant + scope for requested action
## 5. Data model
## 5.1 Better Auth tables
Managed by Better Auth adapter/migrations (expected minimum):
- `user`
- `session`
- `account`
- `verification`
Note:
- use Better Auth canonical table names/types to avoid custom forks
## 5.2 New Paperclip tables
1. `instance_user_roles`
- `id` uuid pk
- `user_id` text not null
- `role` text not null (`instance_admin`)
- `created_at`, `updated_at`
- unique index: `(user_id, role)`
2. `company_memberships`
- `id` uuid pk
- `company_id` uuid fk `companies.id` not null
- `principal_type` text not null (`user | agent`)
- `principal_id` text not null
- `status` text not null (`pending | active | suspended`)
- `membership_role` text null
- `created_at`, `updated_at`
- unique index: `(company_id, principal_type, principal_id)`
- index: `(principal_type, principal_id, status)`
3. `principal_permission_grants`
- `id` uuid pk
- `company_id` uuid fk `companies.id` not null
- `principal_type` text not null (`user | agent`)
- `principal_id` text not null
- `permission_key` text not null
- `scope` jsonb null
- `granted_by_user_id` text null
- `created_at`, `updated_at`
- unique index: `(company_id, principal_type, principal_id, permission_key)`
- index: `(company_id, permission_key)`
4. `invites`
- `id` uuid pk
- `company_id` uuid fk `companies.id` not null
- `invite_type` text not null (`company_join | bootstrap_ceo`)
- `token_hash` text not null
- `allowed_join_types` text not null (`human | agent | both`) for `company_join`
- `defaults_payload` jsonb null
- `expires_at` timestamptz not null
- `invited_by_user_id` text null
- `revoked_at` timestamptz null
- `accepted_at` timestamptz null
- `created_at` timestamptz not null default now()
- unique index: `(token_hash)`
- index: `(company_id, invite_type, revoked_at, expires_at)`
5. `join_requests`
- `id` uuid pk
- `invite_id` uuid fk `invites.id` not null
- `company_id` uuid fk `companies.id` not null
- `request_type` text not null (`human | agent`)
- `status` text not null (`pending_approval | approved | rejected`)
- `request_ip` text not null
- `requesting_user_id` text null
- `request_email_snapshot` text null
- `agent_name` text null
- `adapter_type` text null
- `capabilities` text null
- `agent_defaults_payload` jsonb null
- `created_agent_id` uuid fk `agents.id` null
- `approved_by_user_id` text null
- `approved_at` timestamptz null
- `rejected_by_user_id` text null
- `rejected_at` timestamptz null
- `created_at`, `updated_at`
- index: `(company_id, status, request_type, created_at desc)`
- unique index: `(invite_id)` to enforce one request per consumed invite
## 5.3 Existing table changes
1. `issues`
- add `assignee_user_id` text null
- enforce single-assignee invariant:
- at most one of `assignee_agent_id` and `assignee_user_id` is non-null
2. `agents`
- keep existing `permissions` JSON for transition only
- mark as deprecated in code path once principal grants are live
## 5.4 Migration strategy
Migration ordering:
1. add new tables/columns/indexes
2. backfill minimum memberships/grants for existing data:
- create local implicit admin membership context in local mode at runtime (not persisted as Better Auth user)
- for cloud, bootstrap creates first admin user role on acceptance
3. switch authz reads to new tables
4. remove legacy board-only checks
## 6. API contract (new/changed)
All under `/api`.
## 6.1 Health
`GET /api/health` response additions:
- `deploymentMode`
- `authReady`
- `bootstrapStatus` (`ready | bootstrap_pending`)
## 6.2 Invites
1. `POST /api/companies/:companyId/invites`
- create `company_join` invite
- copy-link value returned once
2. `GET /api/invites/:token`
- validate token
- return invite landing payload
- includes `allowedJoinTypes`
3. `POST /api/invites/:token/accept`
- body:
- `requestType: human | agent`
- human path: no extra payload beyond authenticated user
- agent path: `agentName`, `adapterType`, `capabilities`, optional adapter defaults
- consumes invite token
- creates `join_requests(status=pending_approval)`
4. `POST /api/invites/:inviteId/revoke`
- revokes non-consumed invite
## 6.3 Join requests
1. `GET /api/companies/:companyId/join-requests?status=pending_approval&requestType=...`
2. `POST /api/companies/:companyId/join-requests/:requestId/approve`
- human:
- create/activate `company_memberships`
- apply default grants
- agent:
- create `agents` row
- create pending claim context for API key
- create/activate agent membership
- apply default grants
3. `POST /api/companies/:companyId/join-requests/:requestId/reject`
4. `POST /api/join-requests/:requestId/claim-api-key`
- approved agent request only
- returns plaintext key once
- stores hash in `agent_api_keys`
## 6.4 Membership and grants
1. `GET /api/companies/:companyId/members`
- returns both principal types
2. `PATCH /api/companies/:companyId/members/:memberId/permissions`
- upsert/remove grants
3. `PUT /api/admin/users/:userId/company-access`
- instance admin only
4. `GET /api/admin/users/:userId/company-access`
5. `POST /api/admin/users/:userId/promote-instance-admin`
6. `POST /api/admin/users/:userId/demote-instance-admin`
## 6.5 Inbox
`GET /api/companies/:companyId/inbox` additions:
- pending join request alert items when actor can `joins:approve`
- each item includes inline action metadata:
- join request id
- request type
- source IP
- human email snapshot when applicable
## 7. Server implementation details
## 7.1 Config and startup
Files:
- `packages/shared/src/config-schema.ts`
- `server/src/config.ts`
- `server/src/index.ts`
- `server/src/startup-banner.ts`
Changes:
- add deployment mode + bind host settings
- enforce loopback-only for `local_trusted`
- enforce Better Auth readiness in `cloud_hosted`
- banner shows mode and bootstrap status
## 7.2 Better Auth integration
Files:
- `server/package.json` (dependency)
- `server/src/auth/*` (new)
- `server/src/app.ts` (mount auth handler endpoints + session middleware)
Changes:
- add Better Auth server instance
- cookie/session handling for cloud mode
- no-op session auth in local mode
## 7.3 Actor middleware
Files:
- `server/src/middleware/auth.ts`
- `server/src/routes/authz.ts`
- `server/src/middleware/board-mutation-guard.ts`
Changes:
- stop defaulting every request to board in cloud mode
- map local requests to `local_implicit_admin` actor in local mode
- map Better Auth session to `user` actor in cloud mode
- preserve agent bearer path
- replace `assertBoard` with permission-oriented helpers:
- `requireInstanceAdmin(req)`
- `requireCompanyAccess(req, companyId)`
- `requireCompanyPermission(req, companyId, permissionKey, scope?)`
## 7.4 Authorization services
Files:
- `server/src/services` (new modules)
- `memberships.ts`
- `permissions.ts`
- `invites.ts`
- `join-requests.ts`
- `instance-admin.ts`
Changes:
- centralized permission evaluation
- centralized membership resolution
- one place for principal-type branching
## 7.5 Routes
Files:
- `server/src/routes/index.ts` and new route modules:
- `auth.ts` (if needed)
- `invites.ts`
- `join-requests.ts`
- `members.ts`
- `instance-admin.ts`
- `inbox.ts` (or extension of existing inbox source)
Changes:
- add new endpoints listed above
- apply company and permission checks consistently
- log all mutations through activity log service
## 7.6 Activity log and audit
Files:
- `server/src/services/activity-log.ts`
- call sites in invite/join/member/admin routes
Required actions:
- `invite.created`
- `invite.revoked`
- `join.requested`
- `join.approved`
- `join.rejected`
- `membership.activated`
- `permission.granted`
- `permission.revoked`
- `instance_admin.promoted`
- `instance_admin.demoted`
- `agent_api_key.claimed`
- `agent_api_key.revoked`
## 7.7 Real-time and inbox propagation
Files:
- `server/src/services/live-events.ts`
- `server/src/realtime/live-events-ws.ts`
- inbox data source endpoint(s)
Changes:
- emit join-request events
- ensure inbox refresh path includes join alerts
## 8. CLI implementation
Files:
- `cli/src/index.ts`
- `cli/src/commands/onboard.ts`
- `cli/src/commands/configure.ts`
- `cli/src/prompts/server.ts`
Commands:
1. `paperclipai auth bootstrap-ceo`
- create bootstrap invite
- print one-time URL
2. `paperclipai onboard`
- in cloud mode with `bootstrap_pending`, print bootstrap URL and next steps
- in local mode, skip bootstrap requirement
Config additions:
- deployment mode
- bind host (validated against mode)
## 9. UI implementation
Files:
- routing: `ui/src/App.tsx`
- API clients: `ui/src/api/*`
- pages/components (new):
- `AuthLogin` / `AuthSignup` (cloud mode)
- `BootstrapPending` page
- `InviteLanding` page
- `InstanceSettings` page
- join approval components in `Inbox`
- member/grant management in company settings
Required UX:
1. Cloud unauthenticated user:
- redirect to login/signup
2. Cloud bootstrap pending:
- block app with setup command guidance
3. Invite landing:
- choose human vs agent path (respect `allowedJoinTypes`)
- submit join request
- show pending approval confirmation
4. Inbox:
- show join approval cards with approve/reject actions
- include source IP and human email snapshot when applicable
5. Local mode:
- no login prompts
- full settings/invite/approval UI available
## 10. Security controls
1. Token handling
- invite tokens hashed at rest
- API keys hashed at rest
- one-time plaintext key reveal only
2. Local mode isolation
- loopback bind enforcement
- startup hard-fail on non-loopback host
3. Cloud auth
- no implicit board fallback
- session auth mandatory for human mutations
4. Join workflow hardening
- one request per invite token
- pending request has no data access
- approval required before membership activation
5. Abuse controls
- rate limit invite accept and key claim endpoints
- structured logging for join and claim failures
## 11. Migration and compatibility
## 11.1 Runtime compatibility
- keep existing board-dependent routes functional while migrating authz helper usage
- phase out `assertBoard` calls only after permission helpers cover all routes
## 11.2 Data compatibility
- do not delete `agents.permissions` in V1
- stop reading it once grants are wired
- remove in post-V1 cleanup migration
## 11.3 Better Auth user ID handling
- treat `user.id` as text end-to-end
- existing `created_by_user_id` and similar text fields remain valid
## 12. Testing strategy
## 12.1 Unit tests
- permission evaluator:
- instance admin bypass
- grant checks
- scope checks
- join approval state machine
- invite token lifecycle
## 12.2 Integration tests
- cloud mode unauthenticated mutation -> `401`
- local mode implicit admin mutation -> success
- invite accept -> pending join -> no access
- join approve (human) -> membership/grants active
- join approve (agent) -> key claim once
- cross-company access denied for user and agent principals
- local mode non-loopback bind -> startup failure
## 12.3 UI tests
- login gate in cloud mode
- bootstrap pending screen
- invite landing choose-path UX
- inbox join alert approve/reject flows
## 12.4 Regression tests
- existing agent API key flows still work
- task assignment and checkout invariants unchanged
- activity logging still emitted for all mutations
## 13. Delivery plan
## Phase A: Foundations
- config mode/bind host support
- startup guardrails
- Better Auth integration skeleton
- actor type expansion
## Phase B: Schema and authz core
- add membership/grants/invite/join tables
- add permission service and helpers
- wire company/member/instance admin checks
## Phase C: Invite + join backend
- invite create/revoke
- invite accept -> pending request
- approve/reject + key claim
- activity log + live events
## Phase D: UI + CLI
- cloud login/bootstrap screens
- invite landing
- inbox join approval actions
- instance settings and member permissions
- bootstrap CLI command and onboarding updates
## Phase E: Hardening
- full integration/e2e coverage
- docs updates (`SPEC-implementation`, `DEVELOPING`, `CLI`)
- cleanup of legacy board-only codepaths
## 14. Verification gate
Before handoff:
```sh
pnpm -r typecheck
pnpm test:run
pnpm build
```
If any command is skipped, record exactly what was skipped and why.
## 15. Done criteria
1. Behavior matches locked V1 decisions in this doc and `doc/plan/humans-and-permissions.md`.
2. Cloud mode requires auth; local mode has no login UX.
3. Unified invite + pending approval flow works for both humans and agents.
4. Shared principal membership + permission system is live for users and agents.
5. Local mode remains loopback-only and fails otherwise.
6. Inbox shows actionable join approvals.
7. All new mutating paths are activity-logged.

View File

@@ -0,0 +1,421 @@
# Humans and Permissions Plan
Status: Draft
Date: 2026-02-21
Owner: Server + UI + Shared + DB
## Goal
Add first-class human users and permissions while preserving two deployment modes:
- local trusted single-user mode with no login friction
- cloud-hosted multi-user mode with mandatory authentication and authorization
## Why this plan
Current V1 assumptions are centered on one board operator. We now need:
- multi-human collaboration with per-user permissions
- safe cloud deployment defaults (no accidental loginless production)
- local mode that still feels instant (`npx paperclipai run` and go)
- agent-to-human task delegation, including a human inbox
- one user account with access to multiple companies in one deployment
- instance admins who can manage company access across the instance
- join approvals surfaced as actionable inbox alerts, not buried in admin-only pages
- a symmetric invite-and-approve onboarding path for both humans and agents
- one shared membership and permission model for both humans and agents
## Product constraints
1. Keep company scoping strict for every new table, endpoint, and permission check.
2. Preserve existing control-plane invariants:
- single-assignee task model
- approval gates
- budget hard-stop behavior
- mutation activity logging
3. Keep local mode easy and trusted, but prevent unsafe cloud posture.
## Deployment modes
## Mode A: `local_trusted`
Behavior:
- no login UI
- browser opens directly into board context
- embedded DB and local storage defaults remain
- a local implicit human actor exists for attribution
- local implicit actor has effective `instance_admin` authority for that instance
- full invite/approval/permission settings flows remain available in local mode (including agent enrollment)
Guardrails:
- server binds to loopback by default
- fail startup if mode is `local_trusted` with non-loopback bind
- UI shows a persistent "Local trusted mode" badge
## Mode B: `cloud_hosted`
Behavior:
- login required for all human endpoints
- Better Auth for human auth
- initial auth method: email + password
- email verification is not required for initial release
- hosted DB and remote deployment supported
- multi-user sessions and role/permission enforcement
Guardrails:
- fail startup if auth provider/session config is missing
- fail startup if insecure auth bypass flag is set
- health payload includes mode and auth readiness
## Authentication choice
- use Better Auth for human users
- start with email/password login only
- no email confirmation requirement in V1
- keep implementation structured so social/SSO providers can be added later without changing membership/permission semantics
## Auth and actor model
Unify request actors into a single model:
- `user` (authenticated human)
- `agent` (API key)
- `local_board_implicit` (local trusted mode only)
Rules:
- in `cloud_hosted`, only `user` and `agent` are valid actors
- in `local_trusted`, unauthenticated browser/API requests resolve to `local_board_implicit`
- `local_board_implicit` is authorized as an instance admin principal for local operations
- all mutating actions continue writing `activity_log` with actor type/id
## First admin bootstrap
Problem:
- new cloud deployments need a safe, explicit first human admin path
- app cannot assume a pre-existing admin account
- `local_trusted` does not use bootstrap flow because implicit local instance admin already exists
Bootstrap flow:
1. If no `instance_admin` user exists for the deployment, instance is in `bootstrap_pending` state.
2. CLI command `pnpm paperclipai auth bootstrap-ceo` creates a one-time CEO onboarding invite URL for that instance.
3. `pnpm paperclipai onboard` runs this bootstrap check and prints the invite URL automatically when `bootstrap_pending`.
4. Visiting the app while `bootstrap_pending` shows a blocking setup page with the exact CLI command to run (`pnpm paperclipai onboard`).
5. Accepting that CEO invite creates the first admin user and exits bootstrap mode.
Security rules:
- bootstrap invite is single-use, short-lived, and token-hash stored at rest
- only one active bootstrap invite at a time per instance (regeneration revokes prior token)
- bootstrap actions are audited in `activity_log`
## Data model additions
## New tables
1. `users`
- identity record for human users (email-based)
- optional instance-level role field (or companion table) for admin rights
2. `company_memberships`
- `company_id`, `principal_type` (`user | agent`), `principal_id`
- status (`pending | active | suspended`), role metadata
- stores effective access state for both humans and agents
- many-to-many: one principal can belong to multiple companies
3. `invites`
- `company_id`, `invite_type` (`company_join | bootstrap_ceo`), token hash, expires_at, invited_by, revoked_at, accepted_at
- one-time share link (no pre-bound invite email)
- `allowed_join_types` (`human | agent | both`) for `company_join` links
- optional defaults payload keyed by join type:
- human defaults: initial permissions/membership role
- agent defaults: proposed role/title/adapter defaults
4. `principal_permission_grants`
- `company_id`, `principal_type` (`user | agent`), `principal_id`, `permission_key`
- explicit grants such as `agents:create`
- includes scope payload for chain-of-command limits
- normalized table (not JSON blob) for auditable grant/revoke history
5. `join_requests`
- `invite_id`, `company_id`, `request_type` (`human | agent`)
- `status` (`pending_approval | approved | rejected`)
- common review metadata:
- `request_ip`
- `approved_by_user_id`, `approved_at`, `rejected_by_user_id`, `rejected_at`
- human request fields:
- `requesting_user_id`, `request_email_snapshot`
- agent request fields:
- `agent_name`, `adapter_type`, `capabilities`, `created_agent_id` nullable until approved
- each consumed invite creates exactly one join request record after join type is selected
6. `issues` extension
- add `assignee_user_id` nullable
- preserve single-assignee invariant with XOR check:
- exactly zero or one of `assignee_agent_id` / `assignee_user_id`
## Compatibility
- existing `created_by_user_id` / `author_user_id` fields remain and become fully active
- agent API keys remain auth credentials; membership + grants remain authorization source
## Permission model (initial set)
Principle:
- humans and agents use the same membership + grant evaluation engine
- permission checks resolve against `(company_id, principal_type, principal_id)` for both actor types
- this avoids separate authz codepaths and keeps behavior consistent
Role layers:
- `instance_admin`: deployment-wide admin, can access/manage all companies and user-company access mapping
- `company_member`: company-scoped permissions only
Core grants:
1. `agents:create`
2. `users:invite`
3. `users:manage_permissions`
4. `tasks:assign`
5. `tasks:assign_scope` (org-constrained delegation)
6. `joins:approve` (approve/reject human and agent join requests)
Additional behavioral rules:
- instance admins can promote/demote instance admins and manage user access across companies
- board-level users can manage company grants inside companies they control
- non-admin principals can only act within explicit grants
- assignment checks apply to both agent and human assignees
## Chain-of-command scope design
Initial approach:
- represent assignment scope as an allow rule over org hierarchy
- examples:
- `subtree:<agentId>` (can assign into that manager subtree)
- `exclude:<agentId>` (cannot assign to protected roles, e.g., CEO)
Enforcement:
- resolve target assignee org position
- evaluate allow/deny scope rules before assignment mutation
- return `403` for out-of-scope assignments
## Invite and signup flow
1. Authorized user creates one `company_join` invite share link with optional defaults + expiry.
2. System sends invite URL containing one-time token.
3. Invite landing page presents two paths: `Join as human` or `Join as agent` (subject to `allowed_join_types`).
4. Requester selects join path and submits required data.
5. Submission consumes token and creates a `pending_approval` join request (no access yet).
6. Join request captures review metadata:
- human: authenticated email
- both: source IP
- agent: proposed agent metadata
7. Company admin/instance admin reviews request and approves or rejects.
8. On approval:
- human: activate `company_membership` and apply permission grants
- agent: create agent record and enable API-key claim flow
9. Link is one-time and cannot be reused.
10. Inviter/admin can revoke invite before acceptance.
Security rules:
- store invite token hashed at rest
- one-time use token with short expiry
- all invite lifecycle events logged in `activity_log`
- pending users cannot read or mutate any company data until approved
## Join approval inbox
- join requests generate inbox alerts for eligible approvers (`joins:approve` or admin role)
- alerts appear in both:
- global/company inbox feed
- dedicated pending-approvals UI
- each alert includes approve/reject actions inline (no context switch required)
- alert payload must include:
- requester email when `request_type=human`
- source IP
- request type (`human | agent`)
## Human inbox and agent-to-human delegation
Behavior:
- agents can assign tasks to humans when policy permits
- humans see assigned tasks in inbox view (including in local trusted mode)
- comment and status transitions follow same issue lifecycle guards
## Agent join path (via unified invite link)
1. Authorized user shares one `company_join` invite link (with `allowed_join_types` including `agent`).
2. Agent operator opens link, chooses `Join as agent`, and submits join payload (name/role/adapter metadata).
3. System creates `pending_approval` agent join request and captures source IP.
4. Approver sees alert in inbox and approves or rejects.
5. On approval, server creates the agent record and mints a long-lived API key.
6. API key is shown exactly once via secure claim flow with explicit "save now" instruction.
Long-lived token policy:
- default to long-lived revocable API keys (hash stored at rest)
- show plaintext key once only
- support immediate revoke/regenerate from admin UI
- optionally add expirations/rotation policy later without changing join flow
API additions (proposed):
- `GET /companies/:companyId/inbox` (human actor scoped to self; includes task items + pending join approval alerts when authorized)
- `POST /companies/:companyId/issues/:issueId/assign-user`
- `POST /companies/:companyId/invites`
- `GET /invites/:token` (invite landing payload with `allowed_join_types`)
- `POST /invites/:token/accept` (body includes `requestType=human|agent` and request metadata)
- `POST /invites/:inviteId/revoke`
- `GET /companies/:companyId/join-requests?status=pending_approval&requestType=human|agent`
- `POST /companies/:companyId/join-requests/:requestId/approve`
- `POST /companies/:companyId/join-requests/:requestId/reject`
- `POST /join-requests/:requestId/claim-api-key` (approved agent requests only)
- `GET /companies/:companyId/members` (returns both human and agent principals)
- `PATCH /companies/:companyId/members/:memberId/permissions`
- `POST /admin/users/:userId/promote-instance-admin`
- `POST /admin/users/:userId/demote-instance-admin`
- `PUT /admin/users/:userId/company-access` (set accessible companies for a user)
- `GET /admin/users/:userId/company-access`
## Local mode UX policy
- no login prompt or account setup required
- local implicit board user is auto-provisioned for audit attribution
- local operator can still use instance settings and company settings as effective instance admin
- invite, join approval, and permission-management UI is available in local mode
- agent onboarding is expected in local mode, including creating invite links and approving join requests
- public/untrusted network ingress is out of scope for V1 local mode
## Cloud agents in this model
- cloud agents continue authenticating through `agent_api_keys`
- same-company boundary checks remain mandatory
- agent ability to assign human tasks is permission-gated, not implicit
## Instance settings surface
This plan introduces instance-level concerns (for example bootstrap state, instance admins, invite defaults, and token policy). There is no dedicated UI surface today.
V1 approach:
- add a minimal `Instance Settings` page for instance admins
- expose key instance settings in API + CLI (`paperclipai configure` / `paperclipai onboard`)
- show read-only instance status indicators in the main UI until full settings UX exists
## Implementation phases
## Phase 1: Mode and guardrails
- add explicit deployment mode config (`local_trusted | cloud_hosted`)
- enforce startup safety checks and health visibility
- implement actor resolution for local implicit board
- map local implicit board actor to instance-admin authorization context
- add bootstrap status signal in health/config surface (`ready | bootstrap_pending`)
- add minimal instance settings API/CLI surface and read-only UI indicators
## Phase 2: Human identity and memberships
- add schema + migrations for users/memberships/invites
- wire auth middleware for cloud mode
- add membership lookup and company access checks
- implement Better Auth email/password flow (no email verification)
- implement first-admin bootstrap invite command and onboard integration
- implement one-time share-link invite acceptance flow with `pending_approval` join requests
## Phase 3: Permissions and assignment scope
- add shared principal grant model and enforcement helpers
- add chain-of-command scope checks for assignment APIs
- add tests for forbidden assignment (for example, cannot assign to CEO)
- add instance-admin promotion/demotion and global company-access management APIs
- add `joins:approve` permission checks for human and agent join approvals
## Phase 4: Invite workflow
- unified `company_join` invite create/landing/accept/revoke endpoints
- join request approve/reject endpoints with review metadata (email when applicable, IP)
- one-time token security and revocation semantics
- UI for invite management, pending join approvals, and membership permissions
- inbox alert generation for pending join requests
- ensure invite and approval UX is enabled in both `cloud_hosted` and `local_trusted`
## Phase 5: Human inbox + task assignment updates
- extend issue assignee model for human users
- inbox API and UI for:
- task assignments
- pending join approval alerts with inline approve/reject actions
- agent-to-human assignment flow with policy checks
## Phase 6: Agent self-join and token claim
- add agent join path on unified invite landing page
- capture agent join requests and admin approval flow
- create one-time API-key claim flow after approval (display once)
## Acceptance criteria
1. `local_trusted` starts with no login and shows board UI immediately.
2. `local_trusted` does not expose optional human login UX in V1.
3. `local_trusted` local implicit actor can manage instance settings, invite links, join approvals, and permission grants.
4. `cloud_hosted` cannot start without auth configured.
5. No request in `cloud_hosted` can mutate data without authenticated actor.
6. If no initial admin exists, app shows bootstrap instructions with CLI command.
7. `pnpm paperclipai onboard` outputs a CEO onboarding invite URL when bootstrap is pending.
8. One `company_join` link supports both human and agent onboarding via join-type selection on the invite landing page.
9. Invite delivery in V1 is copy-link only (no built-in email delivery).
10. Share-link acceptance creates a pending join request; it does not grant immediate access.
11. Pending join requests appear as inbox alerts with inline approve/reject actions.
12. Admin review view includes join metadata before decision (human email when applicable, source IP, and agent metadata for agent requests).
13. Only approved join requests unlock access:
- human: active company membership + permission grants
- agent: agent creation + API-key claim eligibility
14. Agent enrollment follows the same link -> pending approval -> approve flow.
15. Approved agents can claim a long-lived API key exactly once, with plaintext display-once semantics.
16. Agent API keys are indefinite by default in V1 and revocable/regenerable by admins.
17. Public/untrusted ingress for `local_trusted` is not supported in V1 (loopback-only local server).
18. One user can hold memberships in multiple companies.
19. Instance admins can promote another user to instance admin.
20. Instance admins can manage which companies each user can access.
21. Permissions can be granted/revoked per member principal (human or agent) through one shared grant system.
22. Assignment scope prevents out-of-hierarchy or protected-role assignments.
23. Agents can assign tasks to humans only when allowed.
24. Humans can view assigned tasks in inbox and act on them per permissions.
25. All new mutations are company-scoped and logged in `activity_log`.
## V1 decisions (locked)
1. `local_trusted` will not support login UX in V1; implicit local board actor only.
2. Permissions use a normalized shared table: `principal_permission_grants` with scoped grants.
3. Invite delivery is copy-link only in V1 (no built-in email sending).
4. Bootstrap invite creation should require local shell access only (CLI path only, no HTTP bootstrap endpoint).
5. Approval review shows source IP only; no GeoIP/country lookup in V1.
6. Agent API-key lifetime is indefinite by default in V1, with explicit revoke/regenerate controls.
7. Local mode keeps full admin/settings/invite capabilities through the implicit local instance-admin actor.
8. Public/untrusted ingress for local mode is out of scope for V1; no `--dangerous-agent-ingress` in V1.

View File

@@ -0,0 +1,397 @@
# Token Optimization Plan
Date: 2026-03-13
Related discussion: https://github.com/paperclipai/paperclip/discussions/449
## Goal
Reduce token consumption materially without reducing agent capability, control-plane visibility, or task completion quality.
This plan is based on:
- the current V1 control-plane design
- the current adapter and heartbeat implementation
- the linked user discussion
- local runtime data from the default Paperclip instance on 2026-03-13
## Executive Summary
The discussion is directionally right about two things:
1. We should preserve session and prompt-cache locality more aggressively.
2. We should separate stable startup instructions from per-heartbeat dynamic context.
But that is not enough on its own.
After reviewing the code and local run data, the token problem appears to have four distinct causes:
1. **Measurement inflation on sessioned adapters.** Some token counters, especially for `codex_local`, appear to be recorded as cumulative session totals instead of per-heartbeat deltas.
2. **Avoidable session resets.** Task sessions are intentionally reset on timer wakes and manual wakes, which destroys cache locality for common heartbeat paths.
3. **Repeated context reacquisition.** The `paperclip` skill tells agents to re-fetch assignments, issue details, ancestors, and full comment threads on every heartbeat. The API does not currently offer efficient delta-oriented alternatives.
4. **Large static instruction surfaces.** Agent instruction files and globally injected skills are reintroduced at startup even when most of that content is unchanged and not needed for the current task.
The correct approach is:
1. fix telemetry so we can trust the numbers
2. preserve reuse where it is safe
3. make context retrieval incremental
4. add session compaction/rotation so long-lived sessions do not become progressively more expensive
## Validated Findings
### 1. Token telemetry is at least partly overstated today
Observed from the local default instance:
- `heartbeat_runs`: 11,360 runs between 2026-02-18 and 2026-03-13
- summed `usage_json.inputTokens`: `2,272,142,368,952`
- summed `usage_json.cachedInputTokens`: `2,217,501,559,420`
Those totals are not credible as true per-heartbeat usage for the observed prompt sizes.
Supporting evidence:
- `adapter.invoke.payload.prompt` averages were small:
- `codex_local`: ~193 chars average, 6,067 chars max
- `claude_local`: ~160 chars average, 1,160 chars max
- despite that, many `codex_local` runs report millions of input tokens
- one reused Codex session in local data spans 3,607 runs and recorded `inputTokens` growing up to `1,155,283,166`
Interpretation:
- for sessioned adapters, especially Codex, we are likely storing usage reported by the runtime as a **session total**, not a **per-run delta**
- this makes trend reporting, optimization work, and customer trust worse
This does **not** mean there is no real token problem. It means we need a trustworthy baseline before we can judge optimization impact.
### 2. Timer wakes currently throw away reusable task sessions
In `server/src/services/heartbeat.ts`, `shouldResetTaskSessionForWake(...)` returns `true` for:
- `wakeReason === "issue_assigned"`
- `wakeSource === "timer"`
- manual on-demand wakes
That means many normal heartbeats skip saved task-session resume even when the workspace is stable.
Local data supports the impact:
- `timer/system` runs: 6,587 total
- only 976 had a previous session
- only 963 ended with the same session
So timer wakes are the largest heartbeat path and are mostly not resuming prior task state.
### 3. We repeatedly ask agents to reload the same task context
The `paperclip` skill currently tells agents to do this on essentially every heartbeat:
- fetch assignments
- fetch issue details
- fetch ancestor chain
- fetch full issue comments
Current API shape reinforces that pattern:
- `GET /api/issues/:id/comments` returns the full thread
- there is no `since`, cursor, digest, or summary endpoint for heartbeat consumption
- `GET /api/issues/:id` returns full enriched issue context, not a minimal delta payload
This is safe but expensive. It forces the model to repeatedly consume unchanged information.
### 4. Static instruction payloads are not separated cleanly from dynamic heartbeat prompts
The user discussion suggested a bootstrap prompt. That is the right direction.
Current state:
- the UI exposes `bootstrapPromptTemplate`
- adapter execution paths do not currently use it
- several adapters prepend `instructionsFilePath` content directly into the per-run prompt or system prompt
Result:
- stable instructions are re-sent or re-applied in the same path as dynamic heartbeat content
- we are not deliberately optimizing for provider prompt caching
### 5. We inject more skill surface than most agents need
Local adapters inject repo skills into runtime skill directories.
Important `codex_local` nuance:
- Codex does not read skills directly from the active worktree.
- Paperclip discovers repo skills from the current checkout, then symlinks them into `$CODEX_HOME/skills` or `~/.codex/skills`.
- If an existing Paperclip skill symlink already points at another live checkout, the current implementation skips it instead of repointing it.
- This can leave Codex using stale skill content from a different worktree even after Paperclip-side skill changes land.
- That is both a correctness risk and a token-analysis risk, because runtime behavior may not reflect the instructions in the checkout being tested.
Current repo skill sizes:
- `skills/paperclip/SKILL.md`: 17,441 bytes
- `.agents/skills/create-agent-adapter/SKILL.md`: 31,832 bytes
- `skills/paperclip-create-agent/SKILL.md`: 4,718 bytes
- `skills/para-memory-files/SKILL.md`: 3,978 bytes
That is nearly 58 KB of skill markdown before any company-specific instructions.
Not all of that is necessarily loaded into model context every run, but it increases startup surface area and should be treated as a token budget concern.
## Principles
We should optimize tokens under these rules:
1. **Do not lose functionality.** Agents must still be able to resume work safely, understand why tasks exist, and act within governance rules.
2. **Prefer stable context over repeated context.** Unchanged instructions should not be resent through the most expensive path.
3. **Prefer deltas over full reloads.** Heartbeats should consume only what changed since the last useful run.
4. **Measure normalized deltas, not raw adapter claims.** Especially for sessioned CLIs.
5. **Keep escape hatches.** Board/manual runs may still want a forced fresh session.
## Plan
## Phase 1: Make token telemetry trustworthy
This should happen first.
### Changes
- Store both:
- raw adapter-reported usage
- Paperclip-normalized per-run usage
- For sessioned adapters, compute normalized deltas against prior usage for the same persisted session.
- Add explicit fields for:
- `sessionReused`
- `taskSessionReused`
- `promptChars`
- `instructionsChars`
- `hasInstructionsFile`
- `skillSetHash` or skill count
- `contextFetchMode` (`full`, `delta`, `summary`)
- Add per-adapter parser tests that distinguish cumulative-session counters from per-run counters.
### Why
Without this, we cannot tell whether a reduction came from a real optimization or a reporting artifact.
### Success criteria
- per-run token totals stop exploding on long-lived sessions
- a resumed sessions usage curve is believable and monotonic at the session level, but not double-counted at the run level
- cost pages can show both raw and normalized numbers while we migrate
## Phase 2: Preserve safe session reuse by default
This is the highest-leverage behavior change.
### Changes
- Stop resetting task sessions on ordinary timer wakes.
- Keep resetting on:
- explicit manual “fresh run” invocations
- assignment changes
- workspace mismatch
- model mismatch / invalid resume errors
- Add an explicit wake flag like `forceFreshSession: true` when the board wants a reset.
- Record why a session was reused or reset in run metadata.
### Why
Timer wakes are the dominant heartbeat path. Resetting them destroys both session continuity and prompt cache reuse.
### Success criteria
- timer wakes resume the prior task session in the large majority of stable-workspace cases
- no increase in stale-session failures
- lower normalized input tokens per timer heartbeat
## Phase 3: Separate static bootstrap context from per-heartbeat context
This is the right version of the discussions bootstrap idea.
### Changes
- Implement `bootstrapPromptTemplate` in adapter execution paths.
- Use it only when starting a fresh session, not on resumed sessions.
- Keep `promptTemplate` intentionally small and stable:
- who I am
- what triggered this wake
- which task/comment/approval to prioritize
- Move long-lived setup text out of recurring per-run prompts where possible.
- Add UI guidance and warnings when `promptTemplate` contains high-churn or large inline content.
### Why
Static instructions and dynamic wake context have different cache behavior and should be modeled separately.
For `codex_local`, this also requires isolating the Codex skill home per worktree or teaching Paperclip to repoint its own skill symlinks when the source checkout changes. Otherwise prompt and skill improvements in the active worktree may not reach the running agent.
### Success criteria
- fresh-session prompts can remain richer without inflating every resumed heartbeat
- resumed prompts become short and structurally stable
- cache hit rates improve for session-preserving adapters
## Phase 4: Make issue/task context incremental
This is the biggest product change and likely the biggest real token saver after session reuse.
### Changes
Add heartbeat-oriented endpoints and skill behavior:
- `GET /api/agents/me/inbox-lite`
- minimal assignment list
- issue id, identifier, status, priority, updatedAt, lastExternalCommentAt
- `GET /api/issues/:id/heartbeat-context`
- compact issue state
- parent-chain summary
- latest execution summary
- change markers
- `GET /api/issues/:id/comments?after=<cursor>` or `?since=<timestamp>`
- return only new comments
- optional `GET /api/issues/:id/context-digest`
- server-generated compact summary for heartbeat use
Update the `paperclip` skill so the default pattern becomes:
1. fetch compact inbox
2. fetch compact task context
3. fetch only new comments unless this is the first read, a mention-triggered wake, or a cache miss
4. fetch full thread only on demand
### Why
Today we are using full-fidelity board APIs as heartbeat APIs. That is convenient but token-inefficient.
### Success criteria
- after first task acquisition, most heartbeats consume only deltas
- repeated blocked-task or long-thread work no longer replays the whole comment history
- mention-triggered wakes still have enough context to respond correctly
## Phase 5: Add session compaction and controlled rotation
This protects against long-lived session bloat.
### Changes
- Add rotation thresholds per adapter/session:
- turns
- normalized input tokens
- age
- cache hit degradation
- Before rotating, produce a structured carry-forward summary:
- current objective
- work completed
- open decisions
- blockers
- files/artifacts touched
- next recommended action
- Persist that summary in task session state or runtime state.
- Start the next session with:
- bootstrap prompt
- compact carry-forward summary
- current wake trigger
### Why
Even when reuse is desirable, some sessions become too expensive to keep alive indefinitely.
### Success criteria
- very long sessions stop growing without bound
- rotating a session does not cause loss of task continuity
- successful task completion rate stays flat or improves
## Phase 6: Reduce unnecessary skill surface
### Changes
- Move from “inject all repo skills” to an allowlist per agent or per adapter.
- Default local runtime skill set should likely be:
- `paperclip`
- Add opt-in skills for specialized agents:
- `paperclip-create-agent`
- `para-memory-files`
- `create-agent-adapter`
- Expose active skill set in agent config and run metadata.
- For `codex_local`, either:
- run with a worktree-specific `CODEX_HOME`, or
- treat Paperclip-owned Codex skill symlinks as repairable when they point at a different checkout
### Why
Most agents do not need adapter-authoring or memory-system skills on every run.
### Success criteria
- smaller startup instruction surface
- no loss of capability for specialist agents that explicitly need extra skills
## Rollout Order
Recommended order:
1. telemetry normalization
2. timer-wake session reuse
3. bootstrap prompt implementation
4. heartbeat delta APIs + `paperclip` skill rewrite
5. session compaction/rotation
6. skill allowlists
## Acceptance Metrics
We should treat this plan as successful only if we improve both efficiency and task outcomes.
Primary metrics:
- normalized input tokens per successful heartbeat
- normalized input tokens per completed issue
- cache-hit ratio for sessioned adapters
- session reuse rate by invocation source
- fraction of heartbeats that fetch full comment threads
Guardrail metrics:
- task completion rate
- blocked-task rate
- stale-session failure rate
- manual intervention rate
- issue reopen rate after agent completion
Initial targets:
- 30% to 50% reduction in normalized input tokens per successful resumed heartbeat
- 80%+ session reuse on stable timer wakes
- 80%+ reduction in full-thread comment reloads after first task read
- no statistically meaningful regression in completion rate or failure rate
## Concrete Engineering Tasks
1. Add normalized usage fields and migration support for run analytics.
2. Patch sessioned adapter accounting to compute deltas from prior session totals.
3. Change `shouldResetTaskSessionForWake(...)` so timer wakes do not reset by default.
4. Implement `bootstrapPromptTemplate` end-to-end in adapter execution.
5. Add compact heartbeat context and incremental comment APIs.
6. Rewrite `skills/paperclip/SKILL.md` around delta-fetch behavior.
7. Add session rotation with carry-forward summaries.
8. Replace global skill injection with explicit allowlists.
9. Fix `codex_local` skill resolution so worktree-local skill changes reliably reach the runtime.
## Recommendation
Treat this as a two-track effort:
- **Track A: correctness and no-regret wins**
- telemetry normalization
- timer-wake session reuse
- bootstrap prompt implementation
- **Track B: structural token reduction**
- delta APIs
- skill rewrite
- session compaction
- skill allowlists
If we only do Track A, we will improve things, but agents will still re-read too much unchanged task context.
If we only do Track B without fixing telemetry first, we will not be able to prove the gains cleanly.

View File

@@ -0,0 +1,775 @@
# Agent Evals Framework Plan
Date: 2026-03-13
## Context
We need evals for the thing Paperclip actually ships:
- agent behavior produced by adapter config
- prompt templates and bootstrap prompts
- skill sets and skill instructions
- model choice
- runtime policy choices that affect outcomes and cost
We do **not** primarily need a fine-tuning pipeline.
We need a regression framework that can answer:
- if we change prompts or skills, do agents still do the right thing?
- if we switch models, what got better, worse, or more expensive?
- if we optimize tokens, did we preserve task outcomes?
- can we grow the suite over time from real Paperclip usage?
This plan is based on:
- `doc/GOAL.md`
- `doc/PRODUCT.md`
- `doc/SPEC-implementation.md`
- `docs/agents-runtime.md`
- `doc/plans/2026-03-13-TOKEN-OPTIMIZATION-PLAN.md`
- Discussion #449: <https://github.com/paperclipai/paperclip/discussions/449>
- OpenAI eval best practices: <https://developers.openai.com/api/docs/guides/evaluation-best-practices>
- Promptfoo docs: <https://www.promptfoo.dev/docs/configuration/test-cases/> and <https://www.promptfoo.dev/docs/providers/custom-api/>
- LangSmith complex agent eval docs: <https://docs.langchain.com/langsmith/evaluate-complex-agent>
- Braintrust dataset/scorer docs: <https://www.braintrust.dev/docs/annotate/datasets> and <https://www.braintrust.dev/docs/evaluate/write-scorers>
## Recommendation
Paperclip should take a **two-stage approach**:
1. **Start with Promptfoo now** for narrow, prompt-and-skill behavior evals across models.
2. **Grow toward a first-party, repo-local eval harness in TypeScript** for full Paperclip scenario evals.
So the recommendation is no longer “skip Promptfoo.” It is:
- use Promptfoo as the fastest bootstrap layer
- keep eval cases and fixtures in this repo
- avoid making Promptfoo config the deepest long-term abstraction
More specifically:
1. The canonical eval definitions should live in this repo under a top-level `evals/` directory.
2. `v0` should use Promptfoo to run focused test cases across models and providers.
3. The longer-term harness should run **real Paperclip scenarios** against seeded companies/issues/agents, not just raw prompt completions.
4. The scoring model should combine:
- deterministic checks
- structured rubric scoring
- pairwise candidate-vs-baseline judging
- efficiency metrics from normalized usage/cost telemetry
5. The framework should compare **bundles**, not just models.
A bundle is:
- adapter type
- model id
- prompt template(s)
- bootstrap prompt template
- skill allowlist / skill content version
- relevant runtime flags
That is the right unit because that is what actually changes behavior in Paperclip.
## Why This Is The Right Shape
### 1. We need to evaluate system behavior, not only prompt output
Prompt-only tools are useful, but Paperclips real failure modes are often:
- wrong issue chosen
- wrong API call sequence
- bad delegation
- failure to respect approval boundaries
- stale session behavior
- over-reading context
- claiming completion without producing artifacts or comments
Those are control-plane behaviors. They require scenario setup, execution, and trace inspection.
### 2. The repo is already TypeScript-first
The existing monorepo already uses:
- `pnpm`
- `tsx`
- `vitest`
- TypeScript across server, UI, shared contracts, and adapters
A TypeScript-first harness will fit the repo and CI better than introducing a Python-first test subsystem as the default path.
Python can stay optional later for specialty scorers or research experiments.
### 3. We need provider/model comparison without vendor lock-in
OpenAIs guidance is directionally right:
- eval early and often
- use task-specific evals
- log everything
- prefer pairwise/comparison-style judging over open-ended scoring
But OpenAIs Evals API is not the right control plane for Paperclip as the primary system because our target is explicitly multi-model and multi-provider.
### 4. Hosted eval products are useful, and Promptfoo is the right bootstrap tool
The current tradeoff:
- Promptfoo is very good for local, repo-based prompt/provider matrices and CI integration.
- LangSmith is strong on trajectory-style agent evals.
- Braintrust has a clean dataset + scorer + experiment model and strong TypeScript support.
The community suggestion is directionally right:
- Promptfoo lets us start small
- it supports simple assertions like contains / not-contains / regex / custom JS
- it can run the same cases across multiple models
- it supports OpenRouter
- it can move into CI later
That makes it the best `v0` tool for “did this prompt/skill/model change obviously regress?”
But Paperclip should still avoid making a hosted platform or a third-party config format the core abstraction before we have our own stable eval model.
The right move is:
- start with Promptfoo for quick wins
- keep the data portable and repo-owned
- build a thin first-party harness around Paperclip concepts as the system grows
- optionally export to or integrate with other tools later if useful
## What We Should Evaluate
We should split evals into four layers.
### Layer 1: Deterministic contract evals
These should require no judge model.
Examples:
- agent comments on the assigned issue
- no mutation outside the agents company
- approval-required actions do not bypass approval flow
- task transitions are legal
- output contains required structured fields
- artifact links exist when the task required an artifact
- no full-thread refetch on delta-only cases once the API supports it
These are cheap, reliable, and should be the first line of defense.
### Layer 2: Single-step behavior evals
These test narrow behaviors in isolation.
Examples:
- chooses the correct issue from inbox
- writes a reasonable first status comment
- decides to ask for approval instead of acting directly
- delegates to the correct report
- recognizes blocked state and reports it clearly
These are the closest thing to prompt evals, but still framed in Paperclip terms.
### Layer 3: End-to-end scenario evals
These run a full heartbeat or short sequence of heartbeats against a seeded scenario.
Examples:
- new assignment pickup
- long-thread continuation
- mention-triggered clarification
- approval-gated hire request
- manager escalation
- workspace coding task that must leave a meaningful issue update
These should evaluate both final state and trace quality.
### Layer 4: Efficiency and regression evals
These are not “did the answer look good?” evals. They are “did we preserve quality while improving cost/latency?” evals.
Examples:
- normalized input tokens per successful heartbeat
- normalized tokens per completed issue
- session reuse rate
- full-thread reload rate
- wall-clock duration
- cost per successful scenario
This layer is especially important for token optimization work.
## Core Design
## 1. Canonical object: `EvalCase`
Each eval case should define:
- scenario setup
- target bundle(s)
- execution mode
- expected invariants
- scoring rubric
- tags/metadata
Suggested shape:
```ts
type EvalCase = {
id: string;
description: string;
tags: string[];
setup: {
fixture: string;
agentId: string;
trigger: "assignment" | "timer" | "on_demand" | "comment" | "approval";
};
inputs?: Record<string, unknown>;
checks: {
hard: HardCheck[];
rubric?: RubricCheck[];
pairwise?: PairwiseCheck[];
};
metrics: MetricSpec[];
};
```
The important part is that the case is about a Paperclip scenario, not a standalone prompt string.
## 2. Canonical object: `EvalBundle`
Suggested shape:
```ts
type EvalBundle = {
id: string;
adapter: string;
model: string;
promptTemplate: string;
bootstrapPromptTemplate?: string;
skills: string[];
flags?: Record<string, string | number | boolean>;
};
```
Every comparison run should say which bundle was tested.
This avoids the common mistake of saying “model X is better” when the real change was model + prompt + skills + runtime behavior.
## 3. Canonical output: `EvalTrace`
We should capture a normalized trace for scoring:
- run ids
- prompts actually sent
- session reuse metadata
- issue mutations
- comments created
- approvals requested
- artifacts created
- token/cost telemetry
- timing
- raw outputs
The scorer layer should never need to scrape ad hoc logs.
## Scoring Framework
## 1. Hard checks first
Every eval should start with pass/fail checks that can invalidate the run immediately.
Examples:
- touched wrong company
- skipped required approval
- no issue update produced
- returned malformed structured output
- marked task done without required artifact
If a hard check fails, the scenario fails regardless of style or judge score.
## 2. Rubric scoring second
Rubric scoring should use narrow criteria, not vague “how good was this?” prompts.
Good rubric dimensions:
- task understanding
- governance compliance
- useful progress communication
- correct delegation
- evidence of completion
- concision / unnecessary verbosity
Each rubric should be a small 0-1 or 0-2 decision, not a mushy 1-10 scale.
## 3. Pairwise judging for candidate vs baseline
OpenAIs eval guidance is right that LLMs are better at discrimination than open-ended generation.
So for non-deterministic quality checks, the default pattern should be:
- run baseline bundle on the case
- run candidate bundle on the same case
- ask a judge model which is better on explicit criteria
- allow `baseline`, `candidate`, or `tie`
This is better than asking a judge for an absolute quality score with no anchor.
## 4. Efficiency scoring is separate
Do not bury efficiency inside a single blended quality score.
Record it separately:
- quality score
- cost score
- latency score
Then compute a summary decision such as:
- candidate is acceptable only if quality is non-inferior and efficiency is improved
That is much easier to reason about than one magic number.
## Suggested Decision Rule
For PR gating:
1. No hard-check regressions.
2. No significant regression on required scenario pass rate.
3. No significant regression on key rubric dimensions.
4. If the change is token-optimization-oriented, require efficiency improvement on target scenarios.
For deeper comparison reports, show:
- pass rate
- pairwise wins/losses/ties
- median normalized tokens
- median wall-clock time
- cost deltas
## Dataset Strategy
We should explicitly build the dataset from three sources.
### 1. Hand-authored seed cases
Start here.
These should cover core product invariants:
- assignment pickup
- status update
- blocked reporting
- delegation
- approval request
- cross-company access denial
- issue comment follow-up
These are small, clear, and stable.
### 2. Production-derived cases
Per OpenAIs guidance, we should log everything and mine real usage for eval cases.
Paperclip should grow eval coverage by promoting real runs into cases when we see:
- regressions
- interesting failures
- edge cases
- high-value success patterns worth preserving
The initial version can be manual:
- take a real run
- redact/normalize it
- convert it into an `EvalCase`
Later we can automate trace-to-case generation.
### 3. Adversarial and guardrail cases
These should intentionally probe failure modes:
- approval bypass attempts
- wrong-company references
- stale context traps
- irrelevant long threads
- misleading instructions in comments
- verbosity traps
This is where promptfoo-style red-team ideas can become useful later, but it is not the first slice.
## Repo Layout
Recommended initial layout:
```text
evals/
README.md
promptfoo/
promptfooconfig.yaml
prompts/
cases/
cases/
core/
approvals/
delegation/
efficiency/
fixtures/
companies/
issues/
bundles/
baseline/
experiments/
runners/
scenario-runner.ts
compare-runner.ts
scorers/
hard/
rubric/
pairwise/
judges/
rubric-judge.ts
pairwise-judge.ts
lib/
types.ts
traces.ts
metrics.ts
reports/
.gitignore
```
Why top-level `evals/`:
- it makes evals feel first-class
- it avoids hiding them inside `server/` even though they span adapters and runtime behavior
- it leaves room for both TS and optional Python helpers later
- it gives us a clean place for Promptfoo `v0` config plus the later first-party runner
## Execution Model
The harness should support three modes.
### Mode A: Cheap local smoke
Purpose:
- run on PRs
- keep cost low
- catch obvious regressions
Characteristics:
- 5 to 20 cases
- 1 or 2 bundles
- mostly hard checks and narrow rubrics
### Mode B: Candidate vs baseline compare
Purpose:
- evaluate a prompt/skill/model change before merge
Characteristics:
- paired runs
- pairwise judging enabled
- quality + efficiency diff report
### Mode C: Nightly broader matrix
Purpose:
- compare multiple models and bundles
- grow historical benchmark data
Characteristics:
- larger case set
- multiple models
- more expensive rubric/pairwise judging
## CI and Developer Workflow
Suggested commands:
```sh
pnpm evals:smoke
pnpm evals:compare --baseline baseline/codex-default --candidate experiments/codex-lean-skillset
pnpm evals:nightly
```
PR behavior:
- run `evals:smoke` on prompt/skill/adapter/runtime changes
- optionally trigger `evals:compare` for labeled PRs or manual runs
Nightly behavior:
- run larger matrix
- save report artifact
- surface trend lines on pass rate, pairwise wins, and efficiency
## Framework Comparison
## Promptfoo
Best use for Paperclip:
- prompt-level micro-evals
- provider/model comparison
- quick local CI integration
- custom JS assertions and custom providers
- bootstrap-layer evals for one skill or one agent workflow
What changed in this recommendation:
- Promptfoo is now the recommended **starting point**
- especially for “one skill, a handful of cases, compare across models”
Why it still should not be the only long-term system:
- its primary abstraction is still prompt/provider/test-case oriented
- Paperclip needs scenario setup, control-plane state inspection, and multi-step traces as first-class concepts
Recommendation:
- use Promptfoo first
- store Promptfoo config and cases in-repo under `evals/promptfoo/`
- use custom JS/TS assertions and, if needed later, a custom provider that calls Paperclip scenario runners
- do not make Promptfoo YAML the only canonical Paperclip eval format once we outgrow prompt-level evals
## LangSmith
What it gets right:
- final response evals
- trajectory evals
- single-step evals
Why not the primary system today:
- stronger fit for teams already centered on LangChain/LangGraph
- introduces hosted/external workflow gravity before our own eval model is stable
Recommendation:
- copy the trajectory/final/single-step taxonomy
- do not adopt the platform as the default requirement
## Braintrust
What it gets right:
- TypeScript support
- clean dataset/task/scorer model
- production logging to datasets
- experiment comparison over time
Why not the primary system today:
- still externalizes the canonical dataset and review workflow
- we are not yet at the maturity where hosted experiment management should define the shape of the system
Recommendation:
- borrow its dataset/scorer/experiment mental model
- revisit once we want hosted review and experiment history at scale
## OpenAI Evals / Evals API
What it gets right:
- strong eval principles
- emphasis on task-specific evals
- continuous evaluation mindset
Why not the primary system:
- Paperclip must compare across models/providers
- we do not want our primary eval runner coupled to one model vendor
Recommendation:
- use the guidance
- do not use it as the core Paperclip eval runtime
## First Implementation Slice
The first version should be intentionally small.
## Phase 0: Promptfoo bootstrap
Build:
- `evals/promptfoo/promptfooconfig.yaml`
- 5 to 10 focused cases for one skill or one agent workflow
- model matrix using the providers we care about most
- mostly deterministic assertions:
- contains
- not-contains
- regex
- custom JS assertions
Target scope:
- one skill, or one narrow workflow such as assignment pickup / first status update
- compare a small set of bundles across several models
Success criteria:
- we can run one command and compare outputs across models
- prompt/skill regressions become visible quickly
- the team gets signal before building heavier infrastructure
## Phase 1: Skeleton and core cases
Build:
- `evals/` scaffold
- `EvalCase`, `EvalBundle`, `EvalTrace` types
- scenario runner for seeded local cases
- 10 hand-authored core cases
- hard checks only
Target cases:
- assigned issue pickup
- write progress comment
- ask for approval when required
- respect company boundary
- report blocked state
- avoid marking done without artifact/comment evidence
Success criteria:
- a developer can run a local smoke suite
- prompt/skill changes can fail the suite deterministically
- Promptfoo `v0` cases either migrate into or coexist with this layer cleanly
## Phase 2: Pairwise and rubric layer
Build:
- rubric scorer interface
- pairwise judge runner
- candidate vs baseline compare command
- markdown/html report output
Success criteria:
- model/prompt bundle changes produce a readable diff report
- we can tell “better”, “worse”, or “same” on curated scenarios
## Phase 3: Efficiency integration
Build:
- normalized token/cost metrics into eval traces
- cost and latency comparisons
- efficiency gates for token optimization work
Dependency:
- this should align with the telemetry normalization work in `2026-03-13-TOKEN-OPTIMIZATION-PLAN.md`
Success criteria:
- quality and efficiency can be judged together
- token-reduction work no longer relies on anecdotal improvements
## Phase 4: Production-case ingestion
Build:
- tooling to promote real runs into new eval cases
- metadata tagging
- failure corpus growth process
Success criteria:
- the eval suite grows from real product behavior instead of staying synthetic
## Initial Case Categories
We should start with these categories:
1. `core.assignment_pickup`
2. `core.progress_update`
3. `core.blocked_reporting`
4. `governance.approval_required`
5. `governance.company_boundary`
6. `delegation.correct_report`
7. `threads.long_context_followup`
8. `efficiency.no_unnecessary_reloads`
That is enough to start catching the classes of regressions we actually care about.
## Important Guardrails
### 1. Do not rely on judge models alone
Every important scenario needs deterministic checks first.
### 2. Do not gate PRs on a single noisy score
Use pass/fail invariants plus a small number of stable rubric or pairwise checks.
### 3. Do not confuse benchmark score with product quality
The suite must keep growing from real runs, otherwise it will become a toy benchmark.
### 4. Do not evaluate only final output
Trajectory matters for agents:
- did they call the right Paperclip APIs?
- did they ask for approval?
- did they communicate progress?
- did they choose the right issue?
### 5. Do not make the framework vendor-shaped
Our eval model should survive changes in:
- judge provider
- candidate provider
- adapter implementation
- hosted tooling choices
## Open Questions
1. Should the first scenario runner invoke the real server over HTTP, or call services directly in-process?
My recommendation: start in-process for speed, then add HTTP-mode coverage once the model stabilizes.
2. Should we support Python scorers in v1?
My recommendation: no. Keep v1 all-TypeScript.
3. Should we commit baseline outputs?
My recommendation: commit case definitions and bundle definitions, but keep run artifacts out of git.
4. Should we add hosted experiment tracking immediately?
My recommendation: no. Revisit after the local harness proves useful.
## Final Recommendation
Start with Promptfoo for immediate, narrow model-and-prompt comparisons, then grow into a first-party `evals/` framework in TypeScript that evaluates **Paperclip scenarios and bundles**, not just prompts.
Use this structure:
- Promptfoo for `v0` bootstrap
- deterministic hard checks as the foundation
- rubric and pairwise judging for non-deterministic quality
- normalized efficiency metrics as a separate axis
- repo-local datasets that grow from real runs
Use external tools selectively:
- Promptfoo as the initial path for narrow prompt/provider tests
- Braintrust or LangSmith later if we want hosted experiment management
But keep the canonical eval model inside the Paperclip repo and aligned to Paperclips actual control-plane behaviors.

View File

@@ -0,0 +1,780 @@
# Feature specs
## 1) Guided onboarding + first-job magic
The repo already has `onboard`, `doctor`, `run`, deployment modes, and even agent-oriented onboarding text/skills endpoints, but there are also current onboarding/auth validation issues and an open “onboard failed” report. That means this is not just polish; it is product-critical. ([GitHub][1])
### Product decision
Replace “configuration-first onboarding” with **interview-first onboarding**.
### What we want
- Ask 34 questions up front, not 20 settings.
- Generate the right path automatically: local solo, shared private, or public cloud.
- Detect what agent/runtime environment already exists.
- Make it normal to have Claude/OpenClaw/Codex help complete setup.
- End onboarding with a **real first task**, not a blank dashboard.
### What we do not want
- Provider jargon before value.
- “Go find an API key” as the default first instruction.
- A successful install that still leaves users unsure what to do next.
### Proposed UX
On first run, show an interview:
```ts
type OnboardingProfile = {
useCase: "startup" | "agency" | "internal_team";
companySource: "new" | "existing";
deployMode: "local_solo" | "shared_private" | "shared_public";
autonomyMode: "hands_on" | "hybrid" | "full_auto";
primaryRuntime: "claude_code" | "codex" | "openclaw" | "other";
};
```
Questions:
1. What are you building?
2. Is this a new company, an existing company, or a service/agency team?
3. Are you working solo on one machine, sharing privately with a team, or deploying publicly?
4. Do you want full auto, hybrid, or tight manual control?
Then Paperclip should:
- detect installed CLIs/providers/subscriptions
- recommend the matching deployment/auth mode
- generate a local `onboarding.txt` / LLM handoff prompt
- offer a button: **“Open this in Claude / copy setup prompt”**
- create starter objects:
- company
- company goal
- CEO
- founding engineer or equivalent first report
- first suggested task
### Backend / API
- Add `GET /api/onboarding/recommendation`
- Add `GET /api/onboarding/llm-handoff.txt`
- Reuse existing invite/onboarding/skills patterns for local-first bootstrap
- Persist onboarding answers into instance config for later defaults
### Acceptance criteria
- Fresh install with a supported local runtime completes without manual JSON/env editing.
- User sees first live agent action before leaving onboarding.
- A blank dashboard is no longer the default post-install state.
- If a required dependency is missing, the error is prescriptive and fixable from the UI/CLI.
### Non-goals
- Account creation
- enterprise SSO
- perfect provider auto-detection for every runtime
---
## 2) Board command surface, not generic chat
There is a real tension here: the transcript says users want “chat with my CEO,” while the public product definition says Paperclip is **not a chatbot** and V1 communication is **tasks + comments only**. At the same time, the repo is already exploring plugin infrastructure and even a chat plugin via plugin SSE streaming. The clean resolution is: **make the core surface conversational, but keep the data model task/thread-centric; reserve full chat as an optional plugin**. ([GitHub][2])
### Product decision
Build a **Command Composer** backed by issues/comments/approvals, not a separate chat subsystem.
### What we want
- “Talk to the CEO” feeling for the user.
- Every conversation ends up attached to a real company object.
- Strategy discussion can produce issues, artifacts, and approvals.
### What we do not want
- A blank “chat with AI” home screen disconnected from the org.
- Yet another agent-chat product.
### Proposed UX
Add a global composer with modes:
```ts
type ComposerMode = "ask" | "task" | "decision";
type ThreadScope = "company" | "project" | "issue" | "agent";
```
Examples:
- On dashboard: “Ask the CEO for a hiring plan” → creates a `strategy` issue/thread scoped to the company.
- On agent page: “Tell the designer to make this cleaner” → appends an instruction comment to an issue or spawns a new delegated task.
- On approval page: “Why are you asking to hire?” → appends a board comment to the approval context.
Add issue kinds:
```ts
type IssueKind = "task" | "strategy" | "question" | "decision";
```
### Backend / data model
Prefer extending existing `issues` rather than creating `chats`:
- `issues.kind`
- `issues.scope`
- optional `issues.target_agent_id`
- comment metadata: `comment.intent = hint | correction | board_question | board_decision`
### Acceptance criteria
- A user can “ask CEO” from the dashboard and receive a response in a company-scoped thread.
- From that thread, the user can create/approve tasks with one click.
- No separate chat database is required for v1 of this feature.
### Non-goals
- consumer chat UX
- model marketplace
- general-purpose assistant unrelated to company context
---
## 3) Live org visibility + explainability layer
The core product promise is already visibility and governance, but right now the transcript makes clear that the UI is still too close to raw agent execution. The repo already has org charts, activity, heartbeat runs, costs, and agent detail surfaces; the missing piece is the explanatory layer above them. ([GitHub][1])
### Product decision
Default the UI to **human-readable operational summaries**, with raw logs one layer down.
### What we want
- At company level: “who is active, what are they doing, what is moving between teams”
- At agent level: “what is the plan, what step is complete, what outputs were produced”
- At run level: “summary first, transcript second”
### Proposed UX
Company page:
- org chart with live active-state indicators
- delegation animation between nodes when work moves
- current open priorities
- pending approvals
- burn / budget warning strip
Agent page:
- status card
- current issue
- plan checklist
- latest artifact(s)
- summary of last run
- expandable raw trace/logs
Run page:
- **Summary**
- **Steps**
- **Raw transcript / tool calls**
### Backend / API
Generate a run view model from current run/activity data:
```ts
type RunSummary = {
runId: string;
headline: string;
objective: string | null;
currentStep: string | null;
completedSteps: string[];
delegatedTo: { agentId: string; issueId?: string }[];
artifactIds: string[];
warnings: string[];
};
```
Phase 1 can derive this server-side from existing run logs/comments. Persist only if needed later.
### Acceptance criteria
- Board can tell what is happening without reading shell commands.
- Raw logs are still accessible, but not the default surface.
- First task / first hire / first completion moments are visibly celebrated.
### Non-goals
- overdesigned animation system
- perfect semantic summarization before core data quality exists
---
## 4) Artifact system: attachments, file browser, previews
This gap is already showing up in the repo. Storage is present, attachment endpoints exist, but current issues show that attachments are still effectively image-centric and comment attachment rendering is incomplete. At the same time, your transcript wants plans, docs, files, and generated web pages surfaced cleanly. ([GitHub][4])
### Product decision
Introduce a first-class **Artifact** model that unifies:
- uploaded/generated files
- workspace files of interest
- preview URLs
- generated docs/reports
### What we want
- Plans, specs, CSVs, markdown, PDFs, logs, JSON, HTML outputs
- easy discoverability from the issue/run/company pages
- a lightweight file browser for project workspaces
- preview links for generated websites/apps
### What we do not want
- forcing agents to paste everything inline into comments
- HTML stuffed into comment bodies as a workaround
- a full web IDE
### Phase 1: fix the obvious gaps
- Accept non-image MIME types for issue attachments
- Attach files to comments correctly
- Show file metadata + download/open on issue page
### Phase 2: introduce artifacts
```ts
type ArtifactKind = "attachment" | "workspace_file" | "preview" | "report_link";
interface Artifact {
id: string;
companyId: string;
issueId?: string;
runId?: string;
agentId?: string;
kind: ArtifactKind;
title: string;
mimeType?: string;
filename?: string;
sizeBytes?: number;
storageKind: "local_disk" | "s3" | "external_url";
contentPath?: string;
previewUrl?: string;
metadata: Record<string, unknown>;
}
```
### UX
Issue page gets a **Deliverables** section:
- Files
- Reports
- Preview links
- Latest generated artifact highlighted at top
Project page gets a **Files** tab:
- folder tree
- recent changes
- “Open produced files” shortcut
### Preview handling
For HTML/static outputs:
- local deploy → open local preview URL
- shared/public deploy → host via configured preview service or static storage
- preview URL is registered back onto the issue as an artifact
### Acceptance criteria
- Agents can attach `.md`, `.txt`, `.json`, `.csv`, `.pdf`, and `.html`.
- Users can open/download them from the issue page.
- A generated static site can be opened from an issue without hunting through the filesystem.
### Non-goals
- browser IDE
- collaborative docs editor
- full object-storage admin UI
---
## 5) Shared/cloud deployment + cloud runtimes
The repo already has a clear deployment story in docs: `local_trusted`, `authenticated/private`, and `authenticated/public`, plus Tailscale guidance. The roadmap explicitly calls out cloud agents like Cursor / e2b. That means the next step is not inventing a deployment model; it is making the shared/cloud path canonical and production-usable. ([GitHub][5])
### Product decision
Make **shared/private deploy** and **public/cloud deploy** first-class supported modes, and add **remote runtime drivers** for cloud-executed agents.
### What we want
- one instance a team can actually share
- local-first path that upgrades to private/public without a mental model change
- remote agent execution for non-local runtimes
### Proposed architecture
Separate **control plane** from **execution runtime** more explicitly:
```ts
type RuntimeDriver = "local_process" | "remote_sandbox" | "webhook";
interface ExecutionHandle {
externalRunId: string;
status: "queued" | "running" | "completed" | "failed" | "cancelled";
previewUrl?: string;
logsUrl?: string;
}
```
First remote driver: `remote_sandbox` for e2b-style execution.
### Deliverables
- canonical deploy recipes:
- local solo
- shared private (Tailscale/private auth)
- public cloud (managed Postgres + object storage + public URL)
- runtime health page
- adapter/runtime capability matrix
- one official reference deployment path
### UX
New “Deployment” settings page:
- instance mode
- auth/exposure
- storage/database status
- runtime drivers configured
- health and reachability checks
### Acceptance criteria
- Two humans can log into one authenticated/private instance and use it concurrently.
- A public deployment can run agents via at least one remote runtime.
- `doctor` catches missing public/private config and gives concrete fixes.
### Non-goals
- fully managed Paperclip SaaS
- every possible cloud provider in v1
---
## 6) Multi-human collaboration (minimal, not enterprise RBAC)
This is the biggest deliberate departure from the current V1 spec. Publicly, V1 still says “single human board operator” and puts role-based human granularity out of scope. But the transcript is right that shared use is necessary if Paperclip is going to be real for teams. The key is to do a **minimal collaboration model**, not a giant permission system. ([GitHub][2])
### Product decision
Ship **coarse multi-user company memberships**, not fine-grained enterprise RBAC.
### Proposed roles
```ts
type CompanyRole = "owner" | "admin" | "operator" | "viewer";
```
- **owner**: instance/company ownership, user invites, config
- **admin**: manage org, agents, budgets, approvals
- **operator**: create/update issues, interact with agents, view artifacts
- **viewer**: read-only
### Data model
```ts
interface CompanyMembership {
userId: string;
companyId: string;
role: CompanyRole;
invitedByUserId: string;
createdAt: string;
}
```
Stretch goal later:
- optional project/team scoping
### What we want
- shared dashboard for real teams
- user attribution in activity log
- simple invite flow
- company-level isolation preserved
### What we do not want
- per-field ACLs
- SCIM/SSO/enterprise admin consoles
- ten permission toggles per page
### Acceptance criteria
- Team of 3 can use one shared Paperclip instance.
- Every user action is attributed correctly in activity.
- Company membership boundaries are enforced.
- Viewer cannot mutate; operator/admin can.
### Non-goals
- enterprise RBAC
- cross-company matrix permissions
- multi-board governance logic in first cut
---
## 7) Auto mode + interrupt/resume
This is a product behavior issue, not a UI nicety. If agents cannot keep working or accept course correction without restarting, the autonomy model feels fake.
### Product decision
Make auto mode and mid-run interruption first-class runtime semantics.
### What we want
- Auto mode that continues until blocked by approvals, budgets, or explicit pause.
- Mid-run “you missed this” correction without losing session continuity.
- Clear state when an agent is waiting, blocked, or paused.
### Proposed state model
```ts
type RunState =
| "queued"
| "running"
| "waiting_approval"
| "waiting_input"
| "paused"
| "completed"
| "failed"
| "cancelled";
```
Add board interjections as resumable input events:
```ts
interface RunMessage {
runId: string;
authorUserId: string;
mode: "hint" | "correction" | "hard_override";
body: string;
resumeCurrentSession: boolean;
}
```
### UX
Buttons on active run:
- Pause
- Resume
- Interrupt
- Abort
- Restart from scratch
Interrupt opens a small composer that explicitly says:
- continue current session
- or restart run
### Acceptance criteria
- A board comment can resume an active session instead of spawning a fresh one.
- Session ID remains stable for “continue” path.
- UI clearly distinguishes blocked vs. waiting vs. paused.
### Non-goals
- simultaneous multi-user live editing of the same run transcript
- perfect conversational UX before runtime semantics are fixed
---
## 8) Cost safety + heartbeat/runtime hardening
This is probably the most important immediate workstream. The transcript says token burn is the highest pain, and the repo currently has active issues around budget enforcement evidence, onboarding/auth validation, and circuit-breaker style waste prevention. Public docs already promise hard budgets, and the issue tracker is pointing at the missing operational protections. ([GitHub][6])
### Product decision
Treat this as a **P0 runtime contract**, not a nice-to-have.
### Part A: deterministic wake gating
Do cheap, explicit work detection before invoking an LLM.
```ts
type WakeReason =
| "new_assignment"
| "new_comment"
| "mention"
| "approval_resolved"
| "scheduled_scan"
| "manual";
```
Rules:
- if no new actionable input exists, do not call the model
- scheduled scan should be a cheap policy check first, not a full reasoning pass
### Part B: budget contract
Keep the existing public promise, but make it undeniable:
- warning at 80%
- auto-pause at 100%
- visible audit trail
- explicit board override to continue
### Part C: circuit breaker
Add per-agent runtime guards:
```ts
interface CircuitBreakerConfig {
enabled: boolean;
maxConsecutiveNoProgress: number;
maxConsecutiveFailures: number;
tokenVelocityMultiplier: number;
}
```
Trip when:
- no issue/status/comment progress for N runs
- N failures in a row
- token spike vs rolling average
### Part D: refactor heartbeat service
Split current orchestration into modules:
- wake detector
- checkout/lock manager
- adapter runner
- session manager
- cost recorder
- breaker evaluator
- event streamer
### Part E: regression suite
Mandatory automated proofs for:
- onboarding/auth matrix
- 80/100 budget behavior
- no cross-company auth leakage
- no-spurious-wake idle behavior
- active-run resume/interruption
- remote runtime smoke
### Acceptance criteria
- Idle org with no new work does not generate model calls from heartbeat scans.
- 80% shows warning only.
- 100% pauses the agent and blocks continued execution until override.
- Circuit breaker pause is visible in audit/activity.
- Runtime modules have explicit contracts and are testable independently.
### Non-goals
- perfect autonomous optimization
- eliminating all wasted calls in every adapter/provider
---
## 9) Project workspaces, previews, and PR handoff — without becoming GitHub
This is the right way to resolve the code-workflow debate. The repo already has worktree-local instances, project `workspaceStrategy.provisionCommand`, and an RFC for adapter-level git worktree isolation. That is the correct architectural direction: **project execution policies and workspace isolation**, not built-in PR review. ([GitHub][7])
### Product decision
Paperclip should manage the **issue → workspace → preview/PR → review handoff** lifecycle, but leave diffs/review/merge to external tools.
### Proposed config
Prefer repo-local project config:
```yaml
# .paperclip/project.yml
execution:
workspaceStrategy: shared | worktree | ephemeral_container
deliveryMode: artifact | preview | pull_request
provisionCommand: "pnpm install"
teardownCommand: "pnpm clean"
preview:
command: "pnpm dev --port $PAPERCLIP_PREVIEW_PORT"
healthPath: "/"
ttlMinutes: 120
vcs:
provider: github
repo: owner/repo
prPerIssue: true
baseBranch: main
```
### Rules
- For non-code projects: `deliveryMode=artifact`
- For UI/app work: `deliveryMode=preview`
- For git-backed engineering projects: `deliveryMode=pull_request`
- For git-backed projects with `prPerIssue=true`, one issue maps to one isolated branch/worktree
### UX
Issue page shows:
- workspace link/status
- preview URL if available
- PR URL if created
- “Reopen preview” button with TTL
- lifecycle:
- `todo`
- `in_progress`
- `in_review`
- `done`
### What we want
- safe parallel agent work on one repo
- previewable output
- external PR review
- project-defined hooks, not hardcoded assumptions
### What we do not want
- built-in diff viewer
- merge queue
- Jira clone
- mandatory PRs for non-code work
### Acceptance criteria
- Multiple engineer agents can work concurrently without workspace contamination.
- When a project is in PR mode, the issue contains branch/worktree/preview/PR metadata.
- Preview can be reopened on demand until TTL expires.
### Non-goals
- replacing GitHub/GitLab
- universal preview hosting for every framework on day one
---
## 10) Plugin system as the escape hatch
The roadmap already includes plugins, GitHub discussions are active around it, and there is an open issue proposing an SSE bridge specifically to enable streaming plugin UIs such as chat, logs, and monitors. This is exactly the right place for optional surfaces. ([GitHub][1])
### Product decision
Keep the control-plane core thin; put optional high-variance experiences into plugins.
### First-party plugin targets
- Chat
- Knowledge base / RAG
- Log tail / live build output
- Custom tracing or queues
- Doc editor / proposal builder
### Plugin manifest
```ts
interface PluginManifest {
id: string;
version: string;
requestedPermissions: (
| "read_company"
| "read_issue"
| "write_issue_comment"
| "create_issue"
| "stream_ui"
)[];
surfaces: ("company_home" | "issue_panel" | "agent_panel" | "sidebar")[];
workerEntry: string;
uiEntry: string;
}
```
### Platform requirements
- host ↔ worker action bridge
- SSE/UI streaming
- company-scoped auth
- permission declaration
- surface slots in UI
### Acceptance criteria
- A plugin can stream events to UI in real time.
- A chat plugin can converse without requiring chat to become the core Paperclip product.
- Plugin permissions are company-scoped and auditable.
### Non-goals
- plugins mutating core schema directly
- arbitrary privileged code execution without explicit permissions
---
## Priority order I would use
Given the repo state and the transcript, I would sequence it like this:
**P0**
1. Cost safety + heartbeat hardening
2. Guided onboarding + first-job magic
3. Shared/cloud deployment foundation
4. Artifact phase 1: non-image attachments + deliverables surfacing
**P1** 5. Board command surface 6. Visibility/explainability layer 7. Auto mode + interrupt/resume 8. Minimal multi-user collaboration
**P2** 9. Project workspace / preview / PR lifecycle 10. Plugin system + optional chat plugin 11. Template/preset expansion for startup vs agency vs internal-team onboarding
Why this order: the current repo is already getting pressure on onboarding failures, auth/onboarding validation, budget enforcement, and wasted token burn. If those are shaky, everything else feels impressive but unsafe. ([GitHub][3])
## Bottom line
The best synthesis is:
- **Keep** Paperclip as the board-level control plane.
- **Do not** make chat, code review, or workflow-building the core identity.
- **Do** make the product feel conversational, visible, output-oriented, and shared.
- **Do** make coding workflows an integration surface via workspaces/previews/PR links.
- **Use plugins** for richer edges like chat and knowledge.
That keeps the repos current product direction intact while solving almost every pain surfaced in the transcript.
### Key references
- README / positioning / roadmap / product boundaries. ([GitHub][1])
- Product definition. ([GitHub][8])
- V1 implementation spec and explicit non-goals. ([GitHub][2])
- Core concepts and architecture. ([GitHub][9])
- Deployment modes / Tailscale / local-to-cloud path. ([GitHub][5])
- Developing guide: worktree-local instances, provision hooks, onboarding endpoints. ([GitHub][7])
- Current issue pressure: onboarding failure, auth/onboarding validation, budget enforcement, circuit breaker, attachment gaps, plugin chat. ([GitHub][3])
[1]: https://github.com/paperclipai/paperclip "https://github.com/paperclipai/paperclip"
[2]: https://github.com/paperclipai/paperclip/blob/master/doc/SPEC-implementation.md "https://github.com/paperclipai/paperclip/blob/master/doc/SPEC-implementation.md"
[3]: https://github.com/paperclipai/paperclip/issues/704 "https://github.com/paperclipai/paperclip/issues/704"
[4]: https://github.com/paperclipai/paperclip/blob/master/docs/deploy/tailscale-private-access.md "https://github.com/paperclipai/paperclip/blob/master/docs/deploy/tailscale-private-access.md"
[5]: https://github.com/paperclipai/paperclip/blob/master/docs/deploy/deployment-modes.md "https://github.com/paperclipai/paperclip/blob/master/docs/deploy/deployment-modes.md"
[6]: https://github.com/paperclipai/paperclip/issues/692 "https://github.com/paperclipai/paperclip/issues/692"
[7]: https://github.com/paperclipai/paperclip/blob/master/doc/DEVELOPING.md "https://github.com/paperclipai/paperclip/blob/master/doc/DEVELOPING.md"
[8]: https://github.com/paperclipai/paperclip/blob/master/doc/PRODUCT.md "https://github.com/paperclipai/paperclip/blob/master/doc/PRODUCT.md"
[9]: https://github.com/paperclipai/paperclip/blob/master/docs/start/core-concepts.md "https://github.com/paperclipai/paperclip/blob/master/docs/start/core-concepts.md"

View File

@@ -0,0 +1,186 @@
# Paperclip Skill Tightening Plan
## Status
Deferred follow-up. Do not include in the current token-optimization PR beyond documenting the plan.
## Why This Is Deferred
The `paperclip` skill is part of the critical control-plane safety surface. Tightening it may reduce fresh-session token use, but it also carries prompt-regression risk. We do not yet have evals that would let us safely prove behavior preservation across assignment handling, checkout rules, comment etiquette, approval workflows, and escalation paths.
The current PR should ship the lower-risk infrastructure wins first:
- telemetry normalization
- safe session reuse
- incremental issue/comment context
- bootstrap versus heartbeat prompt separation
- Codex worktree isolation
## Current Problem
Fresh runs still spend substantial input tokens even after the context-path fixes. The remaining large startup cost appears to come from loading the full `paperclip` skill and related instruction surface into context at run start.
The skill currently mixes three kinds of content in one file:
- hot-path heartbeat procedure used on nearly every run
- critical policy and safety invariants
- rare workflow/reference material that most runs do not need
That structure is safe but expensive.
## Goals
- reduce first-run instruction tokens without weakening agent safety
- preserve all current Paperclip control-plane capabilities
- keep common heartbeat behavior explicit and easy for agents to follow
- move rare workflows and reference material out of the hot path
- create a structure that can later be evaluated systematically
## Non-Goals
- changing Paperclip API semantics
- removing required governance rules
- deleting rare workflows
- changing agent defaults in the current PR
## Recommended Direction
### 1. Split Hot Path From Lookup Material
Restructure the skill into:
- an always-loaded core section for the common heartbeat loop
- on-demand material for infrequent workflows and deep reference
The core should cover only what is needed on nearly every wake:
- auth and required headers
- inbox-first assignment retrieval
- mandatory checkout behavior
- `heartbeat-context` first
- incremental comment retrieval rules
- mention/self-assign exception
- blocked-task dedup
- status/comment/release expectations before exit
### 2. Normalize The Skill Around One Canonical Procedure
The same rules are currently expressed multiple times across:
- heartbeat steps
- critical rules
- endpoint reference
- workflow examples
Refactor so each operational fact has one primary home:
- procedure
- invariant list
- appendix/reference
This reduces prompt weight and lowers the chance of internal instruction drift.
### 3. Compress Prose Into High-Signal Instruction Forms
Rewrite the hot path using compact operational forms:
- short ordered checklist
- flat invariant list
- minimal examples only where ambiguity would be risky
Reduce:
- narrative explanation
- repeated warnings already covered elsewhere
- large example payloads for common operations
- long endpoint matrices in the main body
### 4. Move Rare Workflows Behind Explicit Triggers
These workflows should remain available but should not dominate fresh-run context:
- OpenClaw invite flow
- project setup flow
- planning `<plan/>` writeback flow
- instructions-path update flow
- detailed link-formatting examples
Recommended approach:
- keep a short pointer in the main skill
- move detailed procedures into sibling skills or referenced docs that agents read only when needed
### 5. Separate Policy From Reference
The skill should distinguish:
- mandatory operating rules
- endpoint lookup/reference
- business-process playbooks
That separation makes it easier to evaluate prompt changes later and lets adapters or orchestration choose what must always be loaded.
## Proposed Target Structure
1. Purpose and authentication
2. Compact heartbeat procedure
3. Hard invariants
4. Required comment/update style
5. Triggered workflow index
6. Appendix/reference
## Rollout Plan
### Phase 1. Inventory And Measure
- annotate the current skill by section and estimate token weight
- identify which sections are truly hot-path versus rare
- capture representative runs to compare before/after prompt size and behavior
### Phase 2. Structural Refactor Without Semantic Changes
- rewrite the main skill into the target structure
- preserve all existing rules and capabilities
- move rare workflow details into referenced companion material
- keep wording changes conservative
### Phase 3. Validate Against Real Scenarios
Run scenario checks for:
- normal assigned heartbeat
- comment-triggered wake
- blocked-task dedup behavior
- approval-resolution wake
- delegation/subtask creation
- board handoff back to user
- plan-request handling
### Phase 4. Decide Default Loading Strategy
After validation, decide whether:
- the entire main skill still loads by default, or
- only the compact core loads by default and rare sections are fetched on demand
Do not change this loading policy without validation.
## Risks
- prompt degradation on control-plane safety rules
- agents forgetting rare but important workflows
- accidental removal of repeated wording that was carrying useful behavior
- introducing ambiguous instruction precedence between the core skill and companion materials
## Preconditions Before Implementation
- define acceptance scenarios for control-plane correctness
- add at least lightweight eval or scripted scenario coverage for key Paperclip flows
- confirm how adapter/bootstrap layering should load skill content versus references
## Success Criteria
- materially lower first-run input tokens for Paperclip-coordinated agents
- no regression in checkout discipline, issue updates, blocked handling, or delegation
- no increase in malformed API usage or ownership mistakes
- agents still complete rare workflows correctly when explicitly asked

File diff suppressed because it is too large Load Diff