Add issue run orchestration plan
Document the design for per-issue execution locks that prevent cascaded wakeups from spawning concurrent runs on the same issue. Covers same-name coalescing, cross-agent deferral, and deferred promotion on run completion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
141
doc/plans/issue-run-orchestration-plan.md
Normal file
141
doc/plans/issue-run-orchestration-plan.md
Normal file
@@ -0,0 +1,141 @@
|
||||
# Issue Run Orchestration Plan
|
||||
|
||||
## Context
|
||||
|
||||
We observed cascaded wakeups on a single issue (for example PAP-39) that produced multiple runs at once:
|
||||
|
||||
- assignee self-wake from `issue_commented`
|
||||
- mention wake to manager/CTO from `issue_comment_mentioned`
|
||||
- overlapping runs on the same issue
|
||||
|
||||
Current behavior is run-centric and agent-centric. It coalesces per-agent+task in `heartbeat.wakeup`, but does not enforce a single active execution slot per issue across all agents.
|
||||
|
||||
## What We Know Today
|
||||
|
||||
- The only reliable issue/run linkage today is derived from `heartbeat_runs.context_snapshot.issueId` with run status `queued` or `running`.
|
||||
- `checkoutRunId` on issues is a work-ownership lock, not an orchestration lock.
|
||||
- Wakeups are created from multiple routes (`issues`, `approvals`, `agents`) and all funnel through `heartbeat.wakeup`.
|
||||
|
||||
## Goals
|
||||
|
||||
1. Prevent self-wake cascades for the same issue when the target agent has the same normalized name as the currently active issue runner.
|
||||
2. Allow cross-agent wake requests, but do not run them until the current issue runner exits.
|
||||
3. Guarantee at most one active (queued or running) execution owner per issue at a time.
|
||||
4. Keep this enforcement centralized in orchestration (not prompt/skill rules).
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Replacing checkout semantics for code-change ownership.
|
||||
- Changing manager escalation policy itself.
|
||||
- Enforcing uniqueness of agent names globally (handled as a separate governance decision).
|
||||
|
||||
## Proposed Model
|
||||
|
||||
Use an explicit issue-level orchestration lock on `issues`.
|
||||
|
||||
### New Issue Properties
|
||||
|
||||
- `executionRunId: uuid | null` (FK to `heartbeat_runs.id`, `ON DELETE SET NULL`)
|
||||
- `executionAgentNameKey: text | null` (normalized lowercase/trimmed agent name)
|
||||
- `executionLockedAt: timestamptz | null`
|
||||
|
||||
`executionRunId` is the canonical “who currently owns orchestration for this issue” field.
|
||||
|
||||
## Orchestration Rules
|
||||
|
||||
### Rule A: No Self-Wake by Same Agent Name
|
||||
|
||||
If a wakeup is issue-scoped and `issues.executionRunId` points to an active run whose `executionAgentNameKey` matches the waking agent name key:
|
||||
|
||||
- do not create a new heartbeat run
|
||||
- write wakeup request as `coalesced` with reason `issue_execution_same_name`
|
||||
- return existing run reference
|
||||
|
||||
### Rule B: Different Name May Wake, But Waits
|
||||
|
||||
If an issue has an active execution lock held by a different agent-name key:
|
||||
|
||||
- accept the wake request
|
||||
- persist request as deferred (new wakeup status `deferred_issue_execution`)
|
||||
- do not create a run yet
|
||||
|
||||
When the active issue run finishes, promote the oldest deferred request for that issue into a queued run and transfer `executionRunId`.
|
||||
|
||||
### Rule C: One Active Execution Owner Per Issue
|
||||
|
||||
For issue-scoped wakeups, run creation is done only while holding a transaction lock on the issue row. This ensures only one queued/running run can become owner at a time.
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
## Phase 1: Schema + Shared Contracts
|
||||
|
||||
1. Add issue columns: `execution_run_id`, `execution_agent_name_key`, `execution_locked_at`.
|
||||
2. Extend shared `Issue` type in `packages/shared/src/types/issue.ts`.
|
||||
3. Add migration and export updates.
|
||||
|
||||
## Phase 2: Centralize Issue Execution Gate in `heartbeat.wakeup`
|
||||
|
||||
1. In `enqueueWakeup`, derive `issueId` from context/payload as today.
|
||||
2. If no `issueId`, keep existing behavior.
|
||||
3. If `issueId` exists:
|
||||
- transaction + `SELECT ... FOR UPDATE` on issue row
|
||||
- resolve/repair stale `executionRunId` (if referenced run is not `queued|running`, clear lock)
|
||||
- apply Rule A/Rule B/Rule C
|
||||
4. Name normalization helper:
|
||||
- `agentNameKey = agent.name.trim().toLowerCase()`
|
||||
|
||||
## Phase 3: Deferred Queue Promotion on Run Finalization
|
||||
|
||||
1. On run terminal states (`succeeded`, `failed`, `cancelled`, orphan reaped):
|
||||
- if run owns `issues.executionRunId`, clear issue lock
|
||||
- promote oldest deferred issue wakeup to queued run
|
||||
- set issue lock to the promoted run
|
||||
- trigger `startNextQueuedRunForAgent(promotedAgentId)`
|
||||
|
||||
## Phase 4: Route Hygiene (“Apply Everywhere”)
|
||||
|
||||
1. Keep route-side wakeup dedupe by agent id, but rely on heartbeat gate as source of truth.
|
||||
2. Ensure all issue-related wakeup calls include `issueId` in payload/context snapshot.
|
||||
3. Add explicit reason codes so logs make suppression/deferral obvious.
|
||||
|
||||
## Phase 5: Tests
|
||||
|
||||
1. Unit tests for `heartbeat.wakeup`:
|
||||
- same-name self-wake suppressed
|
||||
- different-name wake deferred
|
||||
- lock released and deferred wake promoted on owner completion
|
||||
- stale lock recovery
|
||||
2. Integration tests:
|
||||
- comment with `@CTO` during active assignee run does not create concurrent active run
|
||||
- only one active owner per issue at any time
|
||||
3. Regression tests:
|
||||
- non-issue wakeups unchanged
|
||||
- existing assignment/timer behavior unchanged for tasks without issue context
|
||||
|
||||
## Telemetry + Debuggability
|
||||
|
||||
- Add structured reasons in `agent_wakeup_requests.reason`:
|
||||
- `issue_execution_same_name`
|
||||
- `issue_execution_deferred`
|
||||
- `issue_execution_promoted`
|
||||
- Add activity log details for lock transfer events:
|
||||
- from run id / to run id / issue id / agent name key
|
||||
|
||||
## Rollout Strategy
|
||||
|
||||
1. Ship schema + feature flag (`ISSUE_EXECUTION_LOCK_ENABLED`) default off.
|
||||
2. Enable in dev and verify PAP-39 style scenarios.
|
||||
3. Enable in staging with high log verbosity.
|
||||
4. Enable by default after stable run.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
1. A single issue never has more than one active execution owner run (`queued|running`) at once.
|
||||
2. Same-name self-wakes for the same issue are suppressed, not spawned.
|
||||
3. Different-name wakeups are accepted but deferred until issue execution lock is released.
|
||||
4. Mentioning CTO during an active issue run does not start CTO concurrently on that issue.
|
||||
5. Parallelism remains available via separate issues/subissues.
|
||||
|
||||
## Follow-Up (Separate but Related)
|
||||
|
||||
Checkout conflict logic should be corrected independently so assignees with `checkoutRunId = null` can acquire checkout by current run id without false 409 loops.
|
||||
Reference in New Issue
Block a user