nomad extend quotas to include device plugins such as GPUs

The existing resource quotas cover entities such as MHz compute and MB memory but device plugins such as GPUs are currently not supported.

This is a real problem as a single job essentially can consume all available GPU resources leading to a cluster-wide starvation of critical resources.

GPU resource are already being fingerprintet :

$ nomad node status -verbose 429c30c7 
...
Device Group Attributes
Device Group     = nvidia/gpu/Tesla T4
bar1             = 256 MiB
cores_clock      = 1590 MHz
display_state    = Enabled
driver_version   = 460.32.03
memory_clock     = 5001 MHz
memory           = 15109 MiB
pci_bandwidth    = 15760 MB/s
persistence_mode = Disabled
power            = 70 W
...

The ideal solution would be to extend the fingerprinting to include the number of GPU cores and expose those for use in resource quotas - this would bring GPU quotas on par with system CPU and memory quotas (GPU resources could also be exposed as Mhz instead of core count to make those identical to CPU resources).

For example :

limit {
  region = "global"
  region_limit {
    cpu = 2500
    memory = 1000

    nvidia {
      cores = 2560   # number of CUDA cores
      memory = 15000 # MB GPU Memory
    }
  }
}

But simply being able to restrict the number of devices using a resource quota would also be acceptable for now - for example :

limit {
  region = "global"
  region_limit {
    cpu = 2500
    memory = 1000

    nvidia {
      devices = 2 # number of GPU devices
    }
  }
}

Jan 29 '21 12:01 henrikjohansen

Hi @henrikjohansen, and thanks for this suggestion! I've slightly edited the title just so that we don't confuse the ENT quotas feature for the resource block. There's some interesting trickiness to this because devices are plugins, so the resources they expose are fairly arbitrary from the perspective of the scheduler. Should be interesting to figure out!

Jan 29 '21 12:01 tgross

@tgross :+1: Well, your comment is precisely why I included the last example in :point_up:. A simple count of the number of device instances is probably more realistic than exposing fine grained resources for all relevant device types? :thinking:

Jan 29 '21 13:01 henrikjohansen

@tgross This is getting a more and more prevalent issue for us since we have no way to control the utilization of GPU resources amongst our tenants leading to all sorts of problems.

I would :heart: to see the resource quotas feature of Nomad Enterprise enhanced so that the number of available GPUs could be limited per namespace :point_down:

limit {
  region = "global"
  region_limit {
    cpu = 2500
    memory = 1000

    nvidia {
      devices = 2 # number of GPU devices
    }
  }
}

Jun 30 '22 07:06 henrikjohansen

@tgross Just a friendly reminder that this now has grown into a major problem for us. As a Nomad Enterprise customer I am somewhat disappointed about the fact the we cannot guard our rarest and most expensive resource using quotas.

Jul 06 '23 08:07 henrikjohansen

Hey @henrikjohansen it's great that you have this issue open for us engineers to track and discuss feasibility. You may want to escalate with your account rep if you want to put some fire under it in terms of prioritization.

Jul 06 '23 12:07 tgross

This issue is very relevant for us. Is there a way we could contribute, since its enterprise feature?

Dec 05 '23 09:12 illyakaynov

... just my yearly reminder that we are still patiently waiting for this.

Jul 10 '24 18:07 henrikjohansen

@henrikjohansen again, as an Enterprise customer you can best nudge on this from your account manager, so that there's a formal internal Feature Request.

Jul 10 '24 18:07 tgross

hi @henrikjohansen, I marked the issue as resolved, since the changes have been merged into main. The feature will land in Nomad Enterprise 1.9.0, due to be released Oct 14th.

Sep 09 '24 13:09 pkazmierczak

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Jan 08 '25 02:01 github-actions[bot]

nomad nomad copied to clipboard

extend quotas to include device plugins such as GPUs

nomad
nomad copied to clipboard