fix(heartbeat): prevent false process_lost failures on queued and non-child-process runs

- reapOrphanedRuns() now only scans running runs; queued runs are
  legitimately absent from runningProcesses (waiting on concurrency
  limits or issue locks) so including them caused false process_lost
  failures (closes #90)
- Add module-level activeRunExecutions set so non-child-process adapters
  (http, openclaw) are protected from the reaper during execution
- Add resumeQueuedRuns() to restart persisted queued runs after a server
  restart, called at startup and each periodic tick
- Add outer catch in executeRun() so setup failures (ensureRuntimeState,
  resolveWorkspaceForRun, etc.) are recorded as failed runs instead of
  leaving them stuck in running state
- Guard resumeQueuedRuns() against paused/terminated/pending_approval agents
- Increase opencode models discovery timeout from 20s to 45s
This commit is contained in:
Dotta
2026-03-07 12:37:15 -05:00
committed by Michael Averto
parent d14e656ec1
commit f81d37fbf7
4 changed files with 52 additions and 14 deletions

View File

@@ -513,11 +513,14 @@ export async function startServer(): Promise<StartedServer> {
if (config.heartbeatSchedulerEnabled) {
const heartbeat = heartbeatService(db as any);
// Reap orphaned runs at startup (no threshold -- runningProcesses is empty)
void heartbeat.reapOrphanedRuns().catch((err) => {
logger.error({ err }, "startup reap of orphaned heartbeat runs failed");
});
// Reap orphaned running runs at startup while in-memory execution state is empty,
// then resume any persisted queued runs that were waiting on the previous process.
void heartbeat
.reapOrphanedRuns()
.then(() => heartbeat.resumeQueuedRuns())
.catch((err) => {
logger.error({ err }, "startup heartbeat recovery failed");
});
setInterval(() => {
void heartbeat
.tickTimers(new Date())
@@ -530,11 +533,13 @@ export async function startServer(): Promise<StartedServer> {
logger.error({ err }, "heartbeat timer tick failed");
});
// Periodically reap orphaned runs (5-min staleness threshold)
// Periodically reap orphaned runs (5-min staleness threshold) and make sure
// persisted queued work is still being driven forward.
void heartbeat
.reapOrphanedRuns({ staleThresholdMs: 5 * 60 * 1000 })
.then(() => heartbeat.resumeQueuedRuns())
.catch((err) => {
logger.error({ err }, "periodic reap of orphaned heartbeat runs failed");
logger.error({ err }, "periodic heartbeat recovery failed");
});
}, config.heartbeatSchedulerIntervalMs);
}