k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

Multiple device types detected:

Open sipvoip opened this issue 2 years ago • 6 comments

[root@gpu-feature-discovery-sjzg4 /]# gpu-feature-discovery mixed I1022 16:36:51.576911 137 main.go:122] Starting OS watcher. I1022 16:36:51.577238 137 main.go:127] Loading configuration. I1022 16:36:51.577541 137 main.go:139] Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": true, "gdsEnabled": null, "mofedEnabled": null, "gfd": { "oneshot": false, "noTimestamp": false, "sleepInterval": "1m0s", "outputFile": "/etc/kubernetes/node-feature-discovery/features.d/gfd", "machineTypeFile": "/sys/class/dmi/id/product_name" } }, "resources": { "gpus": null }, "sharing": { "timeSlicing": {} } } I1022 16:36:51.577912 137 factory.go:48] Detected NVML platform: found NVML library I1022 16:36:51.577942 137 factory.go:48] Detected non-Tegra platform: /sys/devices/soc0/family file not found I1022 16:36:51.577952 137 factory.go:64] Using NVML manager I1022 16:36:51.577959 137 main.go:144] Start running W1022 16:36:51.602083 137 mig-strategy.go:151] Multiple device types detected: [NVIDIA GeForce RTX 3080 NVIDIA GeForce RTX 3090 NVIDIA GeForce RTX 4090] I1022 16:36:51.606246 137 main.go:187] Creating Labels 2023/10/22 16:36:51 Writing labels to output file /etc/kubernetes/node-feature-discovery/features.d/gfd I1022 16:36:51.606418 137 main.go:197] Sleeping for 60000000000

Only the last GPU is showing up, the 4090. root@kubernetes0: more /etc/kubernetes/node-feature-discovery/features.d/gfd nvidia.com/gpu.compute.major=8 nvidia.com/gpu.count=1 nvidia.com/gpu.family=ampere nvidia.com/gpu.machine=Standard-PC-(i440FX-+-PIIX,-1996) nvidia.com/cuda.driver.minor=113 nvidia.com/gfd.timestamp=1697987627 nvidia.com/gpu.replicas=1 nvidia.com/gpu.memory=24564 nvidia.com/cuda.runtime.minor=2 nvidia.com/cuda.driver.rev=01 nvidia.com/mig.capable=false nvidia.com/gpu.product=NVIDIA-GeForce-RTX-4090 nvidia.com/gpu.compute.minor=9 nvidia.com/cuda.driver.major=535 nvidia.com/cuda.runtime.major=12

How do I get all 3 GPUs to be discovered?

sipvoip avatar Oct 22 '23 16:10 sipvoip

I'm getting the same issue. I have a 3060 and a 3090 in one of my nodes and only the 3090 is showing even though lspci shows both. Have you figured out a workaround yet?

deanpeterson avatar Feb 13 '24 03:02 deanpeterson

The labels exposed by GFD are node-level labels and we don't have the granularity to map these to common nvidia.com/gpu labels at present.

@deanpeterson and @sipvoip what is it that you're trying to do with the labels?

elezar avatar Feb 13 '24 10:02 elezar

@elezar I'm using OpenShift AI with the Ray.io components to create distributed workloads. I have one dual 4090 machine that sees both video cards because they are the same. But I have another node that has a 3060 and a 3090. When I spin up ray.io workers they have to match. So if I say spin up workers with 2 gpus, then to have 2 workers I have to have both the 3060 and 3090 be recognized by the nvidia gpu operator. This was working on my epyc machine. But that node was unstable so I replaced it with a dual xeon machine. For some reason, the dual xeon machine sees both video cards but the nvidia gpu operator is not creating labels for both the 3060 and 3090 and shows nvidia.com/gpu.count=1 even though the gpu discovery pod shows this:

W0213 05:32:27.592511 1 mig-strategy.go:151] Multiple device types detected: [NVIDIA GeForce RTX 3060 NVIDIA GeForce RTX 3090] 68 I0213 05:32:27.600682 1 main.go:187] Creating Labels

deanpeterson avatar Feb 13 '24 17:02 deanpeterson

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Aug 26 '24 04:08 github-actions[bot]

Hi, any update on that? I'm having the exact same issue. Thanks!

ThomasDravigney avatar Sep 09 '24 19:09 ThomasDravigney

Hello. Is there any solution? Same problem. I have Nvidia Geforce GTX 1080 Ti 11 GB and Nvidia Geforce RTX 3080 10 GB, only NVIDIA-GeForce-GTX-1080-Ti is displayed

bambarambambum avatar Sep 20 '24 14:09 bambarambambum

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Dec 20 '24 04:12 github-actions[bot]

This issue was automatically closed due to inactivity.

github-actions[bot] avatar Jan 20 '25 04:01 github-actions[bot]