nomad
nomad copied to clipboard
extend quotas to include device plugins such as GPUs
The existing resource quotas cover entities such as MHz compute and MB memory but device plugins such as GPUs are currently not supported.
This is a real problem as a single job essentially can consume all available GPU resources leading to a cluster-wide starvation of critical resources.
GPU resource are already being fingerprintet :
$ nomad node status -verbose 429c30c7
...
Device Group Attributes
Device Group = nvidia/gpu/Tesla T4
bar1 = 256 MiB
cores_clock = 1590 MHz
display_state = Enabled
driver_version = 460.32.03
memory_clock = 5001 MHz
memory = 15109 MiB
pci_bandwidth = 15760 MB/s
persistence_mode = Disabled
power = 70 W
...
The ideal solution would be to extend the fingerprinting to include the number of GPU cores and expose those for use in resource quotas - this would bring GPU quotas on par with system CPU and memory quotas (GPU resources could also be exposed as Mhz instead of core count to make those identical to CPU resources).
For example :
limit {
region = "global"
region_limit {
cpu = 2500
memory = 1000
nvidia {
cores = 2560 # number of CUDA cores
memory = 15000 # MB GPU Memory
}
}
}
But simply being able to restrict the number of devices using a resource quota would also be acceptable for now - for example :
limit {
region = "global"
region_limit {
cpu = 2500
memory = 1000
nvidia {
devices = 2 # number of GPU devices
}
}
}
Hi @henrikjohansen, and thanks for this suggestion! I've slightly edited the title just so that we don't confuse the ENT quotas feature for the resource block. There's some interesting trickiness to this because devices are plugins, so the resources they expose are fairly arbitrary from the perspective of the scheduler. Should be interesting to figure out!
@tgross :+1: Well, your comment is precisely why I included the last example in :point_up:. A simple count of the number of device instances is probably more realistic than exposing fine grained resources for all relevant device types? :thinking:
@tgross This is getting a more and more prevalent issue for us since we have no way to control the utilization of GPU resources amongst our tenants leading to all sorts of problems.
I would :heart: to see the resource quotas feature of Nomad Enterprise enhanced so that the number of available GPUs could be limited per namespace :point_down:
limit {
region = "global"
region_limit {
cpu = 2500
memory = 1000
nvidia {
devices = 2 # number of GPU devices
}
}
}
@tgross Just a friendly reminder that this now has grown into a major problem for us. As a Nomad Enterprise customer I am somewhat disappointed about the fact the we cannot guard our rarest and most expensive resource using quotas.
Hey @henrikjohansen it's great that you have this issue open for us engineers to track and discuss feasibility. You may want to escalate with your account rep if you want to put some fire under it in terms of prioritization.
This issue is very relevant for us. Is there a way we could contribute, since its enterprise feature?
... just my yearly reminder that we are still patiently waiting for this.
@henrikjohansen again, as an Enterprise customer you can best nudge on this from your account manager, so that there's a formal internal Feature Request.
hi @henrikjohansen, I marked the issue as resolved, since the changes have been merged into main. The feature will land in Nomad Enterprise 1.9.0, due to be released Oct 14th.
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.