devices icon indicating copy to clipboard operation
devices copied to clipboard

Enable resource naming in config

Open MondayCha opened this issue 6 months ago • 5 comments

Motivation

Volcano v1.9.0 introduces Capacity scheduling capabilities, which makes it possible to configure different quotas for different types of GPU queues (important in production environments). For example:

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: queue1
spec:
  reclaimable: true
  deserved: # set the deserved field.
    cpu: 2
    memeory: 8Gi
    nvidia.com/t4: 40
    nvidia.com/a100: 20

However, the default Nvidia Device Plugin reports resources as nvidia.com/gpu, which does not support reporting different GPU models as shown in the example.

To address this, we need to customize the device plugin.

Change Details

The NVIDIA community has already had discussions about this issue:

This PR is modified based on the above discussion.

Further Impact

GPU resource renaming will prevent the DCGM Exporter from obtaining pod-level GPU resource usage monitoring, since the DCGM Exporter must exactly match the resource name nvidia.com/gpu or those with a prefix of nvidia.com/mig-.

MondayCha avatar Aug 01 '24 09:08 MondayCha