nomad icon indicating copy to clipboard operation
nomad copied to clipboard

periodically re-evaluate all jobs

Open tgross opened this issue 2 years ago • 0 comments

Large clusters can have allocations and deployments in various failure states on a regular basis. Because evaluations are idempotent over the state, re-running an evaluation for service and system jobs at any time should be safe. If many allocations or deployments are in failed states and not resolved, then an allocation or client update on the cluster that unblocks evaluations waiting on queued-allocs can potentially cause a large number of evaluations to be processed concurrently, resulting in the reconciler resolving the failed allocations all at once. This can be surprising for operators.

We should consider having a periodic process that re-evaluates service and system jobs on the cluster so that these failed states can be resolved gradually but frequently. One possibility is to tune the rate of re-evaluations to the number of jobs on the cluster so that we're re-evaluating the cluster state at least once a day.

tgross avatar Dec 16 '22 16:12 tgross