devices
devices copied to clipboard
Enable resource naming in config
Motivation
Volcano v1.9.0 introduces Capacity scheduling capabilities, which makes it possible to configure different quotas for different types of GPU queues (important in production environments). For example:
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: queue1
spec:
reclaimable: true
deserved: # set the deserved field.
cpu: 2
memeory: 8Gi
nvidia.com/t4: 40
nvidia.com/a100: 20
However, the default Nvidia Device Plugin reports resources as nvidia.com/gpu
, which does not support reporting different GPU models as shown in the example.
To address this, we need to customize the device plugin.
Change Details
The NVIDIA community has already had discussions about this issue:
- Issue: Advertising specific GPU types as separate extended resource · Issue #424 · NVIDIA/k8s-device-plugin
- Docs: [External]Custom Resource Naming and Supporting Multiple GPU SKUs on a Single Node in Kubernetes
- Code: k8s-device-plugin/cmd/nvidia-device-plugin/main.go at eb8fd565c3df0caca59bf0ff2ae918e647f46af3 · NVIDIA/k8s-device-plugin
This PR is modified based on the above discussion.
Further Impact
GPU resource renaming will prevent the DCGM Exporter from obtaining pod-level GPU resource usage monitoring, since the DCGM Exporter must exactly match the resource name nvidia.com/gpu
or those with a prefix of nvidia.com/mig-
.