k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

List of possible values for `nvidia.com/gpu.product`?

Open romilbhardwaj opened this issue 2 years ago • 2 comments

Thank you for this useful utility. Is there some place I can find the list of possible values for the nvidia.com/gpu.product label?

For example, I'm looking for a list like ["T4", "V100", ''A100-SXM4-80GB", "A100-SXM4-40GB" ....], which contains all possible values.

Alternatively, it would be useful to know how these label values are generated (e.g., how are they parsed from lspci/nvidia-smi?)

romilbhardwaj avatar Aug 25 '23 01:08 romilbhardwaj

Is there a reason to know the complete set? Looking at the linked issue, it seems as if only certain GPUs are supported. Are there other properties -- such as compute capability -- that can be checked instead?

The construction of the product label is defined here: https://github.com/NVIDIA/gpu-feature-discovery/blob/152fa93619e973043d936f19bf20bb465c1ab289/internal/lm/resource.go#L159-L176

Where parts in general is just the device name (https://github.com/NVIDIA/gpu-feature-discovery/blob/152fa93619e973043d936f19bf20bb465c1ab289/internal/lm/resource.go#L41). This can be returned by NVML or CUDA depending on the type of device we're targeting.

Looking at the nvidia-smi output this would be the product name:

$ nvidia-smi -q -i 1 | head

==============NVSMI LOG==============

Timestamp                                 : Fri Aug 25 07:36:37 2023
Driver Version                            : 525.85.12
CUDA Version                              : 12.0

Attached GPUs                             : 8
GPU 00000000:0F:00.0
    Product Name                          : NVIDIA A100-SXM4-40GB

(without the NVIDIA prefix).

elezar avatar Aug 25 '23 07:08 elezar

Thanks for the detailed response @elezar! This is very useful to know.

You're right, we do not need to know the exhaustive set of possible labels. In particular, having the labels for these GPUs would be a good start - ['A100', 'A10G', 'K80', 'M60', 'T4', 'T4g', 'V100', 'A10', 'A100-80GB', 'P100', 'P40', 'P4'].

Short of sourcing these GPUs and running nvidia-smi on them, is there any other method I can use to find the nvidia.com/gpu.product label value GFD will assign them?


To add more context, SkyPilot is a tool for running ML workloads on clouds, and more recently Kubernetes. Our users specifically select a GPU type (e.g., A100, L4, V100) and then SkyPilot runs the task on the requested GPUs. Since for us, these GPU types are non-fungible, (e.g., users choose their hyperparameters depending on what GPU type they will run their job on, costs are different etc), we cannot use compute capability or memory as a filter for choosing which GPU to run on and must select based on GPU type.

romilbhardwaj avatar Aug 25 '23 12:08 romilbhardwaj

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Aug 26 '24 04:08 github-actions[bot]

This issue was automatically closed due to inactivity.

github-actions[bot] avatar Sep 25 '24 04:09 github-actions[bot]

@romilbhardwaj Did you ever get that list?

Jonathan-Eid avatar Jan 28 '25 23:01 Jonathan-Eid