nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Add option to limit the number of concurrently starting allocations per client

Open kemko opened this issue 9 months ago • 3 comments

Proposal

Add a configuration option to limit the number of allocations that can be started concurrently on a Nomad client. "Starting" in this context includes pulling container images and launching the container, but not waiting for health checks to pass. Once a container is running (even if health checks are still pending), the next allocation should be allowed to start.

Use-cases

When a node is undrained or restarted, Nomad may attempt to start a large number of allocations at once. This can overwhelm the node's disk and network subsystems, especially if many container images need to be pulled simultaneously. Limiting the number of concurrently starting allocations would help to:

  • Prevent resource exhaustion (disk, network, CPU) during mass allocation startups.
  • Smooth out the load on the node and the container registry.
  • Reduce the risk of failed allocations due to resource contention.
  • Provide more predictable and stable node recovery after undrain or restart events.

Attempted Solutions

  • Resource limits (CPU, memory, disk) per allocation do not prevent Nomad from starting many allocations at once, as long as the total requested resources fit within the node's capacity.
  • There is no existing configuration in Nomad or the Docker driver to limit the number of allocations being started concurrently.
  • Docker itself has a limit on concurrent layer downloads per image, but not on the number of images being pulled at once.
  • Workarounds such as scripting undrain events or using external controllers are fragile and not integrated with Nomad's scheduling logic.

A built-in, configurable limit would provide a robust and user-friendly solution to this problem.

kemko avatar May 19 '25 08:05 kemko

Thanks for the detailed proposal @kemko! Definitely sounds like something worth solving.

The solutions I can think of in rough order of complexity:

Client side limiting

Adding a (configurable) limit to control how many allocations are starting at once is the simplest and easiest to reason about solution. It's a bit odd we carefully limit parallel allocation garbage collection, but don't have a similar limit for allocation creation.

Scheduler limiting

While it's possible to adjust the scheduler's scoring algorithm to prefer nodes with fewer pending/starting allocations, I think this solution would run into some difficulties:

  1. Scheduling is optimistically concurrent, so either we would have to validate the starting limit in the plan applier or risk the behavior not actually working when you need it most: during periods of heavy concurrent scheduling! Implementing the check in the plan applier is absolutely feasible, but it would increase the likelihood of rejecting plans, again, at the worst possible time: periods of heavy concurrent scheduling! Not a great performance story.
  2. Should it be a hard limit on starting allocs or a new score? A hard limit risks dramatically slowing down scheduling at the worst possible time: recovering from an outage and trying to reschedule everything ASAP. Therefore a new score is preferable as Nomad will prefer nodes with fewer starting allocations, but in the outage recovery scenario it will never slow down scheduling. However exactly how do we calculate the new score? Not an insurmountable problem, but getting scoring even slightly wrong can cause bugs that can easily lurk for years.

Artifact scheduling

Making Nomad's scheduler image and/or artifact aware would be an interesting workaround as you call out. We've always wanted something like this, but as your "fragile" comment alludes: it's a hard problem.

schmichael avatar May 21 '25 23:05 schmichael

One of the nice things about doing it client-side is that the decision making is local to the problem. If a node is just having trouble getting allocs started in a timely fashion (ex. network latency increase to a remote registry), other nodes aren't impacted.

But that has two other tradeoffs:

  • Client-side behavior is less observable because it won't show up in the state store. Maybe we could improve on this by providing metrics for placements delayed by this rate limit?
  • Because of bin-packing, the scheduler is more likely to place more work on a node that's already rate-limiting placements. So maybe we want the client to be able to let the scheduler know that it's in a rate-limited state if that persists for more than a certain period of time? That could be used to temporarily down-rank the score without running into the problem of coordinating a hard limit across jobs concurrently.

tgross avatar May 22 '25 12:05 tgross

@tgross @schmichael Hi! I just wanted to point out that the problem described has surfaced in our environments three more times since the issue was created.

kemko avatar Oct 06 '25 12:10 kemko