gpu-operator
gpu-operator copied to clipboard
DCGM initialization error
1. Quick Debug Checklist
- [x] Are you running on an Ubuntu 18.04 node?
- [x] Are you running Kubernetes v1.13+?
- [x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- [x] Do you have
i2c_core
andipmi_msghandler
loaded on the nodes? - [x] Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces
)
1. Issue or feature description
We installed Nvidia GPU operator version 1.7.1 on our kubernetes cluster using HELM.
But there seems to be DCGM initialization error
and GPU resources are not discovered by the node
Please check the following logs
2. Steps to reproduce the issue
3. Information to attach (optional if deemed irrelevant)
- [x] kubernetes pods status:
kubectl get pods --all-namespaces
$ k get pod
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-5jjwl 1/1 Running 3 20h
gpu-feature-discovery-jfxq8 1/1 Running 0 20h
gpu-feature-discovery-kcr2p 1/1 Running 3 20h
nvidia-container-toolkit-daemonset-8r4df 1/1 Running 0 20h
nvidia-container-toolkit-daemonset-c2lw8 1/1 Running 0 20h
nvidia-container-toolkit-daemonset-mmvzk 1/1 Running 0 20h
nvidia-cuda-validator-fcffx 0/1 Completed 0 20h
nvidia-cuda-validator-j8x8w 0/1 Completed 0 20h
nvidia-cuda-validator-q79nf 0/1 Completed 0 20h
nvidia-dcgm-exporter-5kc4x 0/1 CrashLoopBackOff 242 20h
nvidia-dcgm-exporter-98kbb 0/1 CrashLoopBackOff 242 20h
nvidia-dcgm-exporter-fdqgd 0/1 CrashLoopBackOff 242 20h
nvidia-device-plugin-daemonset-jwsm4 1/1 Running 0 20h
nvidia-device-plugin-daemonset-rsjs8 1/1 Running 3 20h
nvidia-device-plugin-daemonset-tz4z9 1/1 Running 3 20h
nvidia-driver-daemonset-rx22m 1/1 Running 0 20h
nvidia-driver-daemonset-t8tkj 1/1 Running 0 20h
nvidia-driver-daemonset-vb6hh 1/1 Running 0 20h
nvidia-operator-validator-rkpqf 0/1 Init:3/4 163 20h
nvidia-operator-validator-tft4t 0/1 Init:3/4 165 20h
nvidia-operator-validator-xdjk8 0/1 Init:3/4 165 20h
- [x] kubernetes daemonset status:
kubectl get ds -n gpu-operator-resources
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-feature-discovery 3 3 3 3 3 nvidia.com/gpu.deploy.gpu-feature-discovery=true 5d2h
nvidia-container-toolkit-daemonset 3 3 3 3 3 nvidia.com/gpu.deploy.container-toolkit=true 5d2h
nvidia-dcgm-exporter 3 3 0 3 0 nvidia.com/gpu.deploy.dcgm-exporter=true 5d2h
nvidia-device-plugin-daemonset 3 3 3 3 3 nvidia.com/gpu.deploy.device-plugin=true 5d2h
nvidia-driver-daemonset 3 3 3 3 3 nvidia.com/gpu.deploy.driver=true 5d2h
nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 5d2h
nvidia-operator-validator 3 3 0 3 0 nvidia.com/gpu.deploy.operator-validator=true 5d2h
- [x]
kubectl describe daemonsets -n gpu-operator-resources
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 27s daemonset-controller Created pod: nvidia-operator-validator-tft4t
Normal SuccessfulCreate 27s daemonset-controller Created pod: nvidia-operator-validator-rkpqf
Normal SuccessfulCreate 27s daemonset-controller Created pod: nvidia-operator-validator-xdjk8
- [ ] If a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAME
- [x] DCGM error state : k logs nvidia-dcgm-exporter-5kc4x
time="2021-07-06T00:15:36Z" level=info msg="Starting dcgm-exporter"
DCGM Failed to find any GPUs on the node.
time="2021-07-06T00:15:36Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"
- [x] nvidia-operator-validator Pods :
- cuda-validation init container
time="2021-07-05T04:00:45Z" level=info msg="pod nvidia-cuda-validator-q79nf is curently in Pending phase"
time="2021-07-05T04:00:50Z" level=info msg="pod nvidia-cuda-validator-q79nf have run successfully"
- driver-validation init container
running command chroot with args [/run/nvidia/driver nvidia-smi]
Mon Jul 5 04:00:24 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:84:00.0 Off | 0 |
| N/A 39C P0 26W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
- nvidia-operator-validator init container
Error from server (BadRequest): container "nvidia-operator-validator" in pod "nvidia-operator-validator-xdjk8" is waiting to start: PodInitializing
- plugin-validation init container
time="2021-07-06T01:25:22Z" level=info msg="GPU resources are not yet discovered by the node, retry: 1"
time="2021-07-06T01:25:27Z" level=info msg="GPU resources are not yet discovered by the node, retry: 2"
time="2021-07-06T01:25:32Z" level=info msg="GPU resources are not yet discovered by the node, retry: 3"
...
time="2021-07-06T01:27:42Z" level=info msg="GPU resources are not yet discovered by the node, retry: 29"
time="2021-07-06T01:27:47Z" level=info msg="GPU resources are not yet discovered by the node, retry: 30"
time="2021-07-06T01:27:52Z" level=info msg="Error: error validating plugin installation: GPU resources are not discovered by the node"
- toolkit-validation init container
Mon Jul 5 04:00:43 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:84:00.0 Off | 0 |
| N/A 37C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
@AnkitPurohit01 Can you get logs from device-plugin-daemonset pods? I see that some of those pods have restarted, so the logs might help. Also please run this command from any of the worker nodes and attach along with syslog. /run/nvidia/driver/usr/bin/nvidia-bug-report.sh
- Logs from device-plugin-daemonset pods
$ kubectl logs nvidia-device-plugin-daemonset-rsjs8 -n gpu-operator-resources
2021/07/05 04:00:38 Loading NVML
2021/07/05 04:00:38 Starting FS watcher.
2021/07/05 04:00:38 Starting OS watcher.
2021/07/05 04:00:38 Retreiving plugins.
2021/07/05 04:00:38 No MIG devices found. Falling back to mig.strategy=&{}
2021/07/05 04:00:38 No devices found. Waiting indefinitely.
$ kubectl logs nvidia-device-plugin-daemonset-jwsm4 -n gpu-operator-resources
2021/07/05 04:00:34 Loading NVML
2021/07/05 04:00:34 Starting FS watcher.
2021/07/05 04:00:34 Starting OS watcher.
2021/07/05 04:00:34 Retreiving plugins.
2021/07/05 04:00:34 No MIG devices found. Falling back to mig.strategy=&{}
2021/07/05 04:00:34 No devices found. Waiting indefinitely.
$ kubectl logs nvidia-device-plugin-daemonset-tz4z9 -n gpu-operator-resources
2021/07/06 01:38:57 Loading NVML
2021/07/06 01:38:57 Starting FS watcher.
2021/07/06 01:38:57 Starting OS watcher.
2021/07/06 01:38:57 Retreiving plugins.
2021/07/06 01:38:57 No MIG devices found. Falling back to mig.strategy=&{}
2021/07/06 01:38:57 No devices found. Waiting indefinitely.
The result of /run/nvidia/driver/usr/bin/nvidia-bug-report.sh
please check here.
https://drive.google.com/file/d/1CxLqEZNxH3aBCVTwAH622ddkoBnMfYYO/view?usp=sharing
did you found a solution or cause to this problem @AnkitPurohit01 ?
For anyone still looking, I found this solution worked for me: https://github.com/NVIDIA/dcgm-exporter/issues/59#issuecomment-1124400272. And there's some additional info about the solution here: https://github.com/NVIDIA/gpu-monitoring-tools/issues/96#issuecomment-778270215