nomad disable scheduling until initial snapshot is restored

disable scheduling until initial snapshot is restored

Open tgross opened this issue 2 years ago • 0 comments

trafficstars

When new servers join the cluster, they stream a raft snapshot from the existing servers to catch up for replication. But many other operations are spinning up concurrently, including scheduling.

Nomad scheduler workers start immediately on server start. When a scheduler dequeues an evaluation, the leader provides a minimum snapshot index to ensure that the scheduler has an in-memory state at least as current as that index. But the plan applier does not check the index again on plan submit, so if there were a bug in the logic for waiting on the scheduler, it could submit stale plans that stop all allocs and the plan applier would accept these because they “fit” on the current cluster. Even without bugs, this causes a window where evaluations are getting dequeued but can't be planned, and so the evaluations are delayed.

This especially impacts organizations with large clusters where the snapshot takes on the order of minutes to completely restore. In https://github.com/hashicorp/nomad/pull/15523 we're backing off scheduling if we determine we're behind, and in https://github.com/hashicorp/nomad/pull/15522 we provide tunables that can help cluster administrators ensure the snapshots go smoothly. But we could potentially tighten this behavior up by disabling scheduling entirely on the new server until it's ready to successfully do work. This is slightly complicated by bootstrapping and may need https://github.com/hashicorp/nomad/issues/13219 to be completed first. I'm opening this issue for further discussion among the team (and community!)

Dec 16 '22 15:12 tgross

nomad nomad copied to clipboard

disable scheduling until initial snapshot is restored

nomad
nomad copied to clipboard