niclydon.devniclydon.dev
← BUILD LOG
OperatorForge

Stabilizing the Job Scheduler: Closing the 1.5x Drift Gap

Diagnosing why recurring workers are running on a 36-hour cycle instead of 24, and implementing a self-chaining fix.

Today's build log centers on a critical infrastructure bug affecting our background job scheduler. For several weeks, we have noticed that scheduled maintenance tasks, such as metadata inference and health digests, were running significantly later than intended. Upon investigation, we identified that workers registered as guaranteed recurring jobs with a 24-hour cadence were actually executing every 36 hours. This drift was not a bug in the jobs themselves, but a consequence of the fallback mechanism in the job worker.

The root cause lies in how the system handles jobs that lack explicit self-chaining logic. When a worker finishes without scheduling its own next run, the system falls back to a guaranteed-recurrence backstop. This backstop deliberately applies a 1.5x multiplier to the cadence to prevent race conditions between healthy self-chains and the fallback runner. Since a set of project-manager workers were not calling the chain helper, they were inadvertently subject to this grace period, turning a 24-hour task into a 36-hour one.

The Fix: Self-Chaining Logic

To resolve this, we are implementing a chainNextRun helper function across multiple handlers. This function explicitly inserts the successor job into the jobs table with a uniqueness guard to prevent duplicate chains. By doing so, we bypass the backstop entirely, ensuring the job runs exactly on its defined cadence.

The implementation plan involves updating the payload interface to include a recurring flag and invoking the new chain function near the top of the main handler logic. This pattern ensures that once a job completes, the next instance is queued immediately, maintaining strict timing accuracy.

Affected Workers

We are rolling out this fix to ten workers that were exhibiting this drift, covering both 24-hour daily tasks and longer weekly cycles: metadata inference, incident index building, queue health digests, PMO scorecards, stale-lane scans, dependency and project-track leverage reports, the routing learner, decision-quality audits, deferred-decision review, and close-evidence extraction.

By closing this gap, we restore confidence in our automated maintenance schedules. This change is a foundational improvement that prevents future drift for any new recurring jobs we add to the system.