gpushare-device-plugin
gpushare-device-plugin copied to clipboard
No Devices found. Waiting indefinitely.
Plugin cannot find my A100 80G
I use Rancher 2.5.9 to build my cluster, I think the installation steps are correct since it worked on another cluster which I use A100 40G, however, it fails on this cluster using A100 80G.
nvidia-smi gives the correct result.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... Off | 00000000:00:08.0 Off | 0 |
| N/A 39C P0 60W / 300W | 0MiB / 80994MiB | 14% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
But no gpu in cluster
kubectl describe node
Allocatable:
cpu: 2
ephemeral-storage: 48294789041
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 3777904Ki
pods: 110
I tried to find the reason, this is the log of the Pod for the plugin.
[root@data1 ~]# docker logs 6e8823f03d54
I0114 15:37:53.669065 1 main.go:18] Start gpushare device plugin
I0114 15:37:53.669146 1 gpumanager.go:28] Loading NVML
I0114 15:37:53.743358 1 gpumanager.go:37] Fetching devices.
I0114 15:37:53.743407 1 gpumanager.go:39] No devices found. Waiting indefinitely.
[root@data1 ~]#
Any idea how this happen ? Is that possible the plugin does not support A100 80G ?
You can try to upgrade nvidia-docker and try again