fix(heartbeat): prevent false process_lost failures on queued and non-child-process runs
- reapOrphanedRuns() now only scans running runs; queued runs are legitimately absent from runningProcesses (waiting on concurrency limits or issue locks) so including them caused false process_lost failures (closes #90) - Add module-level activeRunExecutions set so non-child-process adapters (http, openclaw) are protected from the reaper during execution - Add resumeQueuedRuns() to restart persisted queued runs after a server restart, called at startup and each periodic tick - Add outer catch in executeRun() so setup failures (ensureRuntimeState, resolveWorkspaceForRun, etc.) are recorded as failed runs instead of leaving them stuck in running state - Guard resumeQueuedRuns() against paused/terminated/pending_approval agents - Increase opencode models discovery timeout from 20s to 45s
This commit is contained in:
@@ -513,11 +513,14 @@ export async function startServer(): Promise<StartedServer> {
|
||||
if (config.heartbeatSchedulerEnabled) {
|
||||
const heartbeat = heartbeatService(db as any);
|
||||
|
||||
// Reap orphaned runs at startup (no threshold -- runningProcesses is empty)
|
||||
void heartbeat.reapOrphanedRuns().catch((err) => {
|
||||
logger.error({ err }, "startup reap of orphaned heartbeat runs failed");
|
||||
});
|
||||
|
||||
// Reap orphaned running runs at startup while in-memory execution state is empty,
|
||||
// then resume any persisted queued runs that were waiting on the previous process.
|
||||
void heartbeat
|
||||
.reapOrphanedRuns()
|
||||
.then(() => heartbeat.resumeQueuedRuns())
|
||||
.catch((err) => {
|
||||
logger.error({ err }, "startup heartbeat recovery failed");
|
||||
});
|
||||
setInterval(() => {
|
||||
void heartbeat
|
||||
.tickTimers(new Date())
|
||||
@@ -530,11 +533,13 @@ export async function startServer(): Promise<StartedServer> {
|
||||
logger.error({ err }, "heartbeat timer tick failed");
|
||||
});
|
||||
|
||||
// Periodically reap orphaned runs (5-min staleness threshold)
|
||||
// Periodically reap orphaned runs (5-min staleness threshold) and make sure
|
||||
// persisted queued work is still being driven forward.
|
||||
void heartbeat
|
||||
.reapOrphanedRuns({ staleThresholdMs: 5 * 60 * 1000 })
|
||||
.then(() => heartbeat.resumeQueuedRuns())
|
||||
.catch((err) => {
|
||||
logger.error({ err }, "periodic reap of orphaned heartbeat runs failed");
|
||||
logger.error({ err }, "periodic heartbeat recovery failed");
|
||||
});
|
||||
}, config.heartbeatSchedulerIntervalMs);
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user