gpu-operator DCGM initialization error

1. Quick Debug Checklist

[x] Are you running on an Ubuntu 18.04 node?
[x] Are you running Kubernetes v1.13+?
[x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
[x] Do you have i2c_core and ipmi_msghandler loaded on the nodes?
[x] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

We installed Nvidia GPU operator version 1.7.1 on our kubernetes cluster using HELM. But there seems to be DCGM initialization error and GPU resources are not discovered by the node

Please check the following logs

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

[x] kubernetes pods status: kubectl get pods --all-namespaces

$ k get pod
NAME                                       READY   STATUS             RESTARTS   AGE
gpu-feature-discovery-5jjwl                1/1     Running            3          20h
gpu-feature-discovery-jfxq8                1/1     Running            0          20h
gpu-feature-discovery-kcr2p                1/1     Running            3          20h
nvidia-container-toolkit-daemonset-8r4df   1/1     Running            0          20h
nvidia-container-toolkit-daemonset-c2lw8   1/1     Running            0          20h
nvidia-container-toolkit-daemonset-mmvzk   1/1     Running            0          20h
nvidia-cuda-validator-fcffx                0/1     Completed          0          20h
nvidia-cuda-validator-j8x8w                0/1     Completed          0          20h
nvidia-cuda-validator-q79nf                0/1     Completed          0          20h
nvidia-dcgm-exporter-5kc4x                 0/1     CrashLoopBackOff   242        20h
nvidia-dcgm-exporter-98kbb                 0/1     CrashLoopBackOff   242        20h
nvidia-dcgm-exporter-fdqgd                 0/1     CrashLoopBackOff   242        20h
nvidia-device-plugin-daemonset-jwsm4       1/1     Running            0          20h
nvidia-device-plugin-daemonset-rsjs8       1/1     Running            3          20h
nvidia-device-plugin-daemonset-tz4z9       1/1     Running            3          20h
nvidia-driver-daemonset-rx22m              1/1     Running            0          20h
nvidia-driver-daemonset-t8tkj              1/1     Running            0          20h
nvidia-driver-daemonset-vb6hh              1/1     Running            0          20h
nvidia-operator-validator-rkpqf            0/1     Init:3/4           163        20h
nvidia-operator-validator-tft4t            0/1     Init:3/4           165        20h
nvidia-operator-validator-xdjk8            0/1     Init:3/4           165        20h

[x] kubernetes daemonset status: kubectl get ds -n gpu-operator-resources

NAME                                 DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE
gpu-feature-discovery                3         3         3       3            3           nvidia.com/gpu.deploy.gpu-feature-discovery=true   5d2h
nvidia-container-toolkit-daemonset   3         3         3       3            3           nvidia.com/gpu.deploy.container-toolkit=true       5d2h
nvidia-dcgm-exporter                 3         3         0       3            0           nvidia.com/gpu.deploy.dcgm-exporter=true           5d2h
nvidia-device-plugin-daemonset       3         3         3       3            3           nvidia.com/gpu.deploy.device-plugin=true           5d2h
nvidia-driver-daemonset              3         3         3       3            3           nvidia.com/gpu.deploy.driver=true                  5d2h
nvidia-mig-manager                   0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true             5d2h
nvidia-operator-validator            3         3         0       3            0           nvidia.com/gpu.deploy.operator-validator=true      5d2h

[x] kubectl describe daemonsets -n gpu-operator-resources

Events:
  Type     Reason            Age                   From                  Message
  ----     ------            ----                  ----                  -------
  Normal   SuccessfulCreate  27s                   daemonset-controller  Created pod: nvidia-operator-validator-tft4t
  Normal   SuccessfulCreate  27s                   daemonset-controller  Created pod: nvidia-operator-validator-rkpqf
  Normal   SuccessfulCreate  27s                   daemonset-controller  Created pod: nvidia-operator-validator-xdjk8

[ ] If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME
[x] DCGM error state : k logs nvidia-dcgm-exporter-5kc4x

time="2021-07-06T00:15:36Z" level=info msg="Starting dcgm-exporter"
DCGM Failed to find any GPUs on the node.
time="2021-07-06T00:15:36Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"

[x] nvidia-operator-validator Pods :

cuda-validation init container

time="2021-07-05T04:00:45Z" level=info msg="pod nvidia-cuda-validator-q79nf is curently in Pending phase"
time="2021-07-05T04:00:50Z" level=info msg="pod nvidia-cuda-validator-q79nf have run successfully"

driver-validation init container

running command chroot with args [/run/nvidia/driver nvidia-smi]
Mon Jul  5 04:00:24 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:84:00.0 Off |                    0 |
| N/A   39C    P0    26W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

nvidia-operator-validator init container

Error from server (BadRequest): container "nvidia-operator-validator" in pod "nvidia-operator-validator-xdjk8" is waiting to start: PodInitializing

plugin-validation init container

time="2021-07-06T01:25:22Z" level=info msg="GPU resources are not yet discovered by the node, retry: 1"
time="2021-07-06T01:25:27Z" level=info msg="GPU resources are not yet discovered by the node, retry: 2"
time="2021-07-06T01:25:32Z" level=info msg="GPU resources are not yet discovered by the node, retry: 3"
...
time="2021-07-06T01:27:42Z" level=info msg="GPU resources are not yet discovered by the node, retry: 29"
time="2021-07-06T01:27:47Z" level=info msg="GPU resources are not yet discovered by the node, retry: 30"
time="2021-07-06T01:27:52Z" level=info msg="Error: error validating plugin installation: GPU resources are not discovered by the node"

toolkit-validation init container

Mon Jul  5 04:00:43 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:84:00.0 Off |                    0 |
| N/A   37C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Jul 07 '21 04:07 AnkitPurohit01

@AnkitPurohit01 Can you get logs from device-plugin-daemonset pods? I see that some of those pods have restarted, so the logs might help. Also please run this command from any of the worker nodes and attach along with syslog. /run/nvidia/driver/usr/bin/nvidia-bug-report.sh

Jul 09 '21 03:07 shivamerla

Logs from device-plugin-daemonset pods

$ kubectl logs nvidia-device-plugin-daemonset-rsjs8 -n gpu-operator-resources
2021/07/05 04:00:38 Loading NVML
2021/07/05 04:00:38 Starting FS watcher.
2021/07/05 04:00:38 Starting OS watcher.
2021/07/05 04:00:38 Retreiving plugins.
2021/07/05 04:00:38 No MIG devices found. Falling back to mig.strategy=&{}
2021/07/05 04:00:38 No devices found. Waiting indefinitely.

$ kubectl logs nvidia-device-plugin-daemonset-jwsm4 -n gpu-operator-resources
2021/07/05 04:00:34 Loading NVML
2021/07/05 04:00:34 Starting FS watcher.
2021/07/05 04:00:34 Starting OS watcher.
2021/07/05 04:00:34 Retreiving plugins.
2021/07/05 04:00:34 No MIG devices found. Falling back to mig.strategy=&{}
2021/07/05 04:00:34 No devices found. Waiting indefinitely.

$ kubectl logs nvidia-device-plugin-daemonset-tz4z9 -n gpu-operator-resources
2021/07/06 01:38:57 Loading NVML
2021/07/06 01:38:57 Starting FS watcher.
2021/07/06 01:38:57 Starting OS watcher.
2021/07/06 01:38:57 Retreiving plugins.
2021/07/06 01:38:57 No MIG devices found. Falling back to mig.strategy=&{}
2021/07/06 01:38:57 No devices found. Waiting indefinitely.

Jul 12 '21 03:07 AnkitPurohit01

The result of /run/nvidia/driver/usr/bin/nvidia-bug-report.sh please check here. https://drive.google.com/file/d/1CxLqEZNxH3aBCVTwAH622ddkoBnMfYYO/view?usp=sharing

Jul 12 '21 07:07 AnkitPurohit01

did you found a solution or cause to this problem @AnkitPurohit01 ?

Jul 24 '23 21:07 dvaldivia

For anyone still looking, I found this solution worked for me: https://github.com/NVIDIA/dcgm-exporter/issues/59#issuecomment-1124400272. And there's some additional info about the solution here: https://github.com/NVIDIA/gpu-monitoring-tools/issues/96#issuecomment-778270215

Dec 06 '23 19:12 francescov1

gpu-operator gpu-operator copied to clipboard

DCGM initialization error

1. Quick Debug Checklist

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

gpu-operator
gpu-operator copied to clipboard