gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

DCGM initialization error

Open AnkitPurohit01 opened this issue 3 years ago • 5 comments

1. Quick Debug Checklist

  • [x] Are you running on an Ubuntu 18.04 node?
  • [x] Are you running Kubernetes v1.13+?
  • [x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
  • [x] Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • [x] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

We installed Nvidia GPU operator version 1.7.1 on our kubernetes cluster using HELM. But there seems to be DCGM initialization error and GPU resources are not discovered by the node

Please check the following logs

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

  • [x] kubernetes pods status: kubectl get pods --all-namespaces
$ k get pod
NAME                                       READY   STATUS             RESTARTS   AGE
gpu-feature-discovery-5jjwl                1/1     Running            3          20h
gpu-feature-discovery-jfxq8                1/1     Running            0          20h
gpu-feature-discovery-kcr2p                1/1     Running            3          20h
nvidia-container-toolkit-daemonset-8r4df   1/1     Running            0          20h
nvidia-container-toolkit-daemonset-c2lw8   1/1     Running            0          20h
nvidia-container-toolkit-daemonset-mmvzk   1/1     Running            0          20h
nvidia-cuda-validator-fcffx                0/1     Completed          0          20h
nvidia-cuda-validator-j8x8w                0/1     Completed          0          20h
nvidia-cuda-validator-q79nf                0/1     Completed          0          20h
nvidia-dcgm-exporter-5kc4x                 0/1     CrashLoopBackOff   242        20h
nvidia-dcgm-exporter-98kbb                 0/1     CrashLoopBackOff   242        20h
nvidia-dcgm-exporter-fdqgd                 0/1     CrashLoopBackOff   242        20h
nvidia-device-plugin-daemonset-jwsm4       1/1     Running            0          20h
nvidia-device-plugin-daemonset-rsjs8       1/1     Running            3          20h
nvidia-device-plugin-daemonset-tz4z9       1/1     Running            3          20h
nvidia-driver-daemonset-rx22m              1/1     Running            0          20h
nvidia-driver-daemonset-t8tkj              1/1     Running            0          20h
nvidia-driver-daemonset-vb6hh              1/1     Running            0          20h
nvidia-operator-validator-rkpqf            0/1     Init:3/4           163        20h
nvidia-operator-validator-tft4t            0/1     Init:3/4           165        20h
nvidia-operator-validator-xdjk8            0/1     Init:3/4           165        20h
  • [x] kubernetes daemonset status: kubectl get ds -n gpu-operator-resources
NAME                                 DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE
gpu-feature-discovery                3         3         3       3            3           nvidia.com/gpu.deploy.gpu-feature-discovery=true   5d2h
nvidia-container-toolkit-daemonset   3         3         3       3            3           nvidia.com/gpu.deploy.container-toolkit=true       5d2h
nvidia-dcgm-exporter                 3         3         0       3            0           nvidia.com/gpu.deploy.dcgm-exporter=true           5d2h
nvidia-device-plugin-daemonset       3         3         3       3            3           nvidia.com/gpu.deploy.device-plugin=true           5d2h
nvidia-driver-daemonset              3         3         3       3            3           nvidia.com/gpu.deploy.driver=true                  5d2h
nvidia-mig-manager                   0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true             5d2h
nvidia-operator-validator            3         3         0       3            0           nvidia.com/gpu.deploy.operator-validator=true      5d2h
  • [x] kubectl describe daemonsets -n gpu-operator-resources
Events:
  Type     Reason            Age                   From                  Message
  ----     ------            ----                  ----                  -------
  Normal   SuccessfulCreate  27s                   daemonset-controller  Created pod: nvidia-operator-validator-tft4t
  Normal   SuccessfulCreate  27s                   daemonset-controller  Created pod: nvidia-operator-validator-rkpqf
  Normal   SuccessfulCreate  27s                   daemonset-controller  Created pod: nvidia-operator-validator-xdjk8
  • [ ] If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME
  • [x] DCGM error state : k logs nvidia-dcgm-exporter-5kc4x
time="2021-07-06T00:15:36Z" level=info msg="Starting dcgm-exporter"
DCGM Failed to find any GPUs on the node.
time="2021-07-06T00:15:36Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"
  • [x] nvidia-operator-validator Pods :
  1. cuda-validation init container
time="2021-07-05T04:00:45Z" level=info msg="pod nvidia-cuda-validator-q79nf is curently in Pending phase"
time="2021-07-05T04:00:50Z" level=info msg="pod nvidia-cuda-validator-q79nf have run successfully"
  1. driver-validation init container
running command chroot with args [/run/nvidia/driver nvidia-smi]
Mon Jul  5 04:00:24 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:84:00.0 Off |                    0 |
| N/A   39C    P0    26W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
  1. nvidia-operator-validator init container
Error from server (BadRequest): container "nvidia-operator-validator" in pod "nvidia-operator-validator-xdjk8" is waiting to start: PodInitializing
  1. plugin-validation init container
time="2021-07-06T01:25:22Z" level=info msg="GPU resources are not yet discovered by the node, retry: 1"
time="2021-07-06T01:25:27Z" level=info msg="GPU resources are not yet discovered by the node, retry: 2"
time="2021-07-06T01:25:32Z" level=info msg="GPU resources are not yet discovered by the node, retry: 3"
...
time="2021-07-06T01:27:42Z" level=info msg="GPU resources are not yet discovered by the node, retry: 29"
time="2021-07-06T01:27:47Z" level=info msg="GPU resources are not yet discovered by the node, retry: 30"
time="2021-07-06T01:27:52Z" level=info msg="Error: error validating plugin installation: GPU resources are not discovered by the node"
  1. toolkit-validation init container
Mon Jul  5 04:00:43 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:84:00.0 Off |                    0 |
| N/A   37C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

AnkitPurohit01 avatar Jul 07 '21 04:07 AnkitPurohit01

@AnkitPurohit01 Can you get logs from device-plugin-daemonset pods? I see that some of those pods have restarted, so the logs might help. Also please run this command from any of the worker nodes and attach along with syslog. /run/nvidia/driver/usr/bin/nvidia-bug-report.sh

shivamerla avatar Jul 09 '21 03:07 shivamerla

  1. Logs from device-plugin-daemonset pods
$ kubectl logs nvidia-device-plugin-daemonset-rsjs8 -n gpu-operator-resources
2021/07/05 04:00:38 Loading NVML
2021/07/05 04:00:38 Starting FS watcher.
2021/07/05 04:00:38 Starting OS watcher.
2021/07/05 04:00:38 Retreiving plugins.
2021/07/05 04:00:38 No MIG devices found. Falling back to mig.strategy=&{}
2021/07/05 04:00:38 No devices found. Waiting indefinitely.
$ kubectl logs nvidia-device-plugin-daemonset-jwsm4 -n gpu-operator-resources
2021/07/05 04:00:34 Loading NVML
2021/07/05 04:00:34 Starting FS watcher.
2021/07/05 04:00:34 Starting OS watcher.
2021/07/05 04:00:34 Retreiving plugins.
2021/07/05 04:00:34 No MIG devices found. Falling back to mig.strategy=&{}
2021/07/05 04:00:34 No devices found. Waiting indefinitely.
$ kubectl logs nvidia-device-plugin-daemonset-tz4z9 -n gpu-operator-resources
2021/07/06 01:38:57 Loading NVML
2021/07/06 01:38:57 Starting FS watcher.
2021/07/06 01:38:57 Starting OS watcher.
2021/07/06 01:38:57 Retreiving plugins.
2021/07/06 01:38:57 No MIG devices found. Falling back to mig.strategy=&{}
2021/07/06 01:38:57 No devices found. Waiting indefinitely.

AnkitPurohit01 avatar Jul 12 '21 03:07 AnkitPurohit01

The result of /run/nvidia/driver/usr/bin/nvidia-bug-report.sh please check here. https://drive.google.com/file/d/1CxLqEZNxH3aBCVTwAH622ddkoBnMfYYO/view?usp=sharing

AnkitPurohit01 avatar Jul 12 '21 07:07 AnkitPurohit01

did you found a solution or cause to this problem @AnkitPurohit01 ?

dvaldivia avatar Jul 24 '23 21:07 dvaldivia

For anyone still looking, I found this solution worked for me: https://github.com/NVIDIA/dcgm-exporter/issues/59#issuecomment-1124400272. And there's some additional info about the solution here: https://github.com/NVIDIA/gpu-monitoring-tools/issues/96#issuecomment-778270215

francescov1 avatar Dec 06 '23 19:12 francescov1