k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

nvidia.com/gpu.product and nvidia.com/gpu.replicas does not reflect heterogeneous device setup

Open Suckzoo opened this issue 3 years ago • 8 comments
trafficstars

Hello,

We're testing gpu-feature-discovery on our DGX machine.

The DGX machine has two types of GPU: one is "NVIDIA-DGX-Display", and the other is "NVIDIA A100-SXM4-80GB" Currently, gpu.product and gpu.replicas nodelabels can hold information of one GPU, literally only one GPU. We're seeing that the values of those two labels are changing periodically: once reflects NVIDIA-DGX-Display, and then reflects NVIDIA-A100-SXM4-80GB, like, nvidia.com/gpu.product: NVIDIA-DGX-Display, nvidia.com/gpu.replicas: 1 <-> nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB, nvidia.com/gpu.replicas: 4

It looks like we need to introduce another label that is capable of holding multiple gpu device information.

Suckzoo avatar Nov 15 '22 11:11 Suckzoo

We are aware of a similar, yet slightly different issue with GFD support on DGX-Station machines. Our plan for the next release is to completely filter out all DISPLAY devices, and only support COMPUTE devices in our enumeration of GPUs for both the device plugin and GFD. In the future, we may decide to support DISPLAY devices, but at that point they would show up as a different type of allocatable device (e.g. nvidia.com/display instead of nvidia.com/gpu), and the labels applied by GFD would reflect this similarly (i.e. nvidia.com/display.product and nvidia.com/display.replicas, etc.).

klueska avatar Nov 15 '22 12:11 klueska

@klueska Thanks for your quick response. One quick question: considering a node consists of 2 RTX 2080 and 2 RTX 3090 (or whatever model, anyway a computer equipped two different model of GPU; I don't know it's a usual setup or not), how would the GFD work in such situation?

Suckzoo avatar Nov 15 '22 12:11 Suckzoo

It only reports one of them at present. Whichever ones happens to show up as index 0 when calling into NVIDIAs NVML library.

klueska avatar Nov 15 '22 12:11 klueska

I meant, GFD in the future. Sorry for the confusion.

Suckzoo avatar Nov 15 '22 12:11 Suckzoo

We had added support about 6 months ago o allow such setups to be detected and allow users to assign a different resource name to each of them (i.e. nvidia.com/rtx-2080 vs nvidia.com/rtx-3090), but it got reverted because our product team wasn’t happy putting arbitrary resource naming in the hands of users.

klueska avatar Nov 15 '22 12:11 klueska

This is how it would have worked: https://docs.google.com/document/d/1dL67t9IqKC2-xqonMi6DV7W2YNZdkmfX7ibB6Jb-qmk/edit

klueska avatar Nov 15 '22 12:11 klueska

There is a KEP for dynamic resource allocation. That architecture allows a Pod to find a node where some suitable GPU exists, even where the node has multiple GPUs. Those GPUs can be fixed (even soldered in!), it doesn't have to be a hotplug scenario.

To me, that'd be the way forward for clusters where nodes have a mix of GPUs.

sftim avatar Mar 10 '23 12:03 sftim

Yes, that is the plan forward. The POC of of our DRA resource driver for GPUs can be found here: https://gitlab.com/nvidia/cloud-native/k8s-dra-driver

It will soon include the notion of a deviceSelector in the GPUClaimParameters object so you can do things like:

apiVersion: gpu.resource.nvidia.com/v1alpha1
kind: GpuClaimParameters
metadata:
  namespace: gpu-test
  name: a100
spec:
  count: 1
  selector:
    andExpression:
      - productName: "*A100*"
      - driverVersion:
          value: "460"
          operator: GreaterThan

or

apiVersion: gpu.resource.nvidia.com/v1alpha1
kind: GpuClaimParameters
metadata:
  namespace: gpu-test
  name: t4
spec:
  count: 1
  selector:
    andExpression:
      - productName: "*T4*"
      - driverVersion:
          value: "460"
          operator: GreaterThan

etc.

klueska avatar Mar 10 '23 12:03 klueska

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Aug 26 '24 04:08 github-actions[bot]

This issue was automatically closed due to inactivity.

github-actions[bot] avatar Sep 25 '24 04:09 github-actions[bot]