gpu-operator
gpu-operator copied to clipboard
Some pods are stuck in init on one of our clusters
1. Quick Debug Checklist
- Are you running on an Ubuntu 18.04 node? No, I am running Red Hat Enterprise Linux 8.6 (Ootpa)
- Are you running Kubernetes v1.13+? Yes, I am running OpenShift 4.11.35 with Kubernetes 1.23
- Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? Yes, CRI-O
- GPU Operator version: 22.9.2
1. Issue or feature description
On one of our clusters, many Nvidia pods are stuck in init. I checked the logs and could not find something suspicious. Maybe there are other logs that tell more?
I suspect the problem appeared after the migration to OpenShift 4.11.
2. Steps to reproduce the issue
Since this is a production cluster it is influenced by hundreds of customers, it is quite hard to find a way to reproduce, however here are the possible ways to reproduce:
- Setup OpenShift with version 4.11.20
- Install NFD Operator 4.11.0-202212070335
- Install GPU Operator version 22.9.2
- Use Nvidia A30 GPUs
3. Information to attach
Logs of one of the GPU Feature Discovery pods stuck in init (gpu-feature-discovery-ddw67)
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
...
Logs of one of the Mig Manager pods stuck in init (nvidia-mig-manager-b4675)
waiting for nvidia container toolkit to be setup
waiting for nvidia container toolkit to be setup
waiting for nvidia container toolkit to be setup
waiting for nvidia container toolkit to be setup
waiting for nvidia container toolkit to be setup
...
Logs of one of the Toolkit Deamonset pods stuck in init (nvidia-container-toolkit-daemonset-664n2)
time="2023-02-07T13:08:04Z" level=info msg="Driver is not pre-installed on the host. Checking driver container status."
running command bash with args [-c stat /run/nvidia/validations/.driver-ctr-ready]
stat: cannot statx '/run/nvidia/validations/.driver-ctr-ready': No such file or directory
command failed, retrying after 5 seconds
running command bash with args [-c stat /run/nvidia/validations/.driver-ctr-ready]
stat: cannot statx '/run/nvidia/validations/.driver-ctr-ready': No such file or directory
command failed, retrying after 5 seconds
running command bash with args [-c stat /run/nvidia/validations/.driver-ctr-ready]
stat: cannot statx '/run/nvidia/validations/.driver-ctr-ready': No such file or directory
command failed, retrying after 5 seconds
...
Logs of one of the Driver Deamonset pods stuck in init (nvidia-driver-daemonset-411.86.202212072103-0-8k4vr)
Running nv-ctr-run-with-dtk
...
WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.
WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that NVIDIA kernel modules matching this driver version are installed separately.
WARNING: This NVIDIA driver package includes Vulkan components, but no Vulkan ICD loader was detected on this system. The NVIDIA Vulkan ICD will not function without the loader. Most distributions package the Vulkan loader; try installing the "vulkan-loader", "vulkan-icd-loader", or "libvulkan1" package.
WARNING: Unable to determine the path to install the libglvnd EGL vendor library config files. Check that you have pkg-config and the libglvnd development libraries installed, or specify a path with --glvnd-egl-config-path.
...
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 525.60.13 for Linux kernel version 4.18.0-372.36.1.el8_6.x86_64
...
Starting NVIDIA persistence daemon...
ls: cannot access '/proc/driver/nvidia-nvswitch/devices/*': No such file or directory
Mounting NVIDIA driver rootfs...
Change device files security context for selinux compatibility
Done, now waiting for signal
oc get pods -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-2x45b 1/1 Running 0 7d8h
gpu-feature-discovery-ddw67 0/1 Init:0/1 0 42h
gpu-feature-discovery-hxcpm 1/1 Running 0 7d8h
gpu-operator-66c69d4d8b-7ll7f 1/1 Running 0 42h
nvidia-container-toolkit-daemonset-62b62 1/1 Running 0 7d19h
nvidia-container-toolkit-daemonset-664n2 0/1 Init:0/1 0 42h
nvidia-container-toolkit-daemonset-hwbpw 0/1 Init:0/1 0 42h
nvidia-cuda-validator-9j7mq 0/1 Completed 0 7d19h
nvidia-cuda-validator-czqfk 0/1 Completed 0 7d19h
nvidia-dcgm-exporter-4rnxn 1/1 Running 0 7d8h
nvidia-dcgm-exporter-79mqk 0/1 Init:0/2 0 42h
nvidia-dcgm-exporter-cv6rg 0/1 CrashLoopBackOff 495 (32s ago) 42h
nvidia-dcgm-jlznx 0/1 Init:0/1 0 42h
nvidia-dcgm-klpt5 1/1 Running 0 7d8h
nvidia-dcgm-pjsqb 0/1 CrashLoopBackOff 503 (25s ago) 42h
nvidia-device-plugin-daemonset-g42hg 1/1 Running 0 7d19h
nvidia-device-plugin-daemonset-jsg8j 0/1 Init:0/1 0 42h
nvidia-device-plugin-daemonset-rkhgx 0/1 Init:0/1 0 42h
nvidia-device-plugin-validator-f5p4w 0/1 Completed 0 7d19h
nvidia-device-plugin-validator-tmszz 0/1 Completed 0 7d19h
nvidia-driver-daemonset-411.86.202212072103-0-8k4vr 2/2 Running 2 7d17h
nvidia-driver-daemonset-411.86.202212072103-0-9jdnc 2/2 Running 0 7d19h
nvidia-driver-daemonset-411.86.202212072103-0-cvtc4 2/2 Running 0 7d19h
nvidia-mig-manager-6qdfb 1/1 Running 0 7d8h
nvidia-mig-manager-b4675 0/1 Init:0/1 0 42h
nvidia-mig-manager-glb7z 0/1 Init:0/1 0 42h
nvidia-node-status-exporter-dvfdg 1/1 Running 0 7d8h
nvidia-node-status-exporter-jl9x5 1/1 Running 2 7d8h
nvidia-node-status-exporter-jmbvp 1/1 Running 0 7d8h
nvidia-operator-validator-cghvm 1/1 Running 0 7d19h
nvidia-operator-validator-gq5g2 0/1 Init:0/4 0 42h
nvidia-operator-validator-lwd4c 0/1 Init:0/4 0 42h
oc get ds -n nvidia-gpu-operator
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-feature-discovery 3 3 2 3 2 nvidia.com/gpu.deploy.gpu-feature-discovery=true 107d
nvidia-container-toolkit-daemonset 3 3 1 2 1 nvidia.com/gpu.deploy.container-toolkit=true 107d
nvidia-dcgm 3 3 1 3 1 nvidia.com/gpu.deploy.dcgm=true 107d
nvidia-dcgm-exporter 3 3 1 3 1 nvidia.com/gpu.deploy.dcgm-exporter=true 107d
nvidia-device-plugin-daemonset 3 3 1 2 1 nvidia.com/gpu.deploy.device-plugin=true 107d
nvidia-driver-daemonset-411.86.202212072103-0 3 3 3 0 3 feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=411.86.202212072103-0,nvidia.com/gpu.deploy.driver=true 7d21h
nvidia-mig-manager 3 3 1 3 1 nvidia.com/gpu.deploy.mig-manager=true 107d
nvidia-node-status-exporter 3 3 3 3 3 nvidia.com/gpu.deploy.node-status-exporter=true 107d
nvidia-operator-validator 3 3 1 2 1 nvidia.com/gpu.deploy.operator-validator=true 107d
NVIDIA shared directory: `ls -la /run/nvidia`
total 4
drwxr-xr-x. 4 root root 100 Feb 7 13:09 .
drwxr-xr-x. 48 root root 1260 Feb 7 13:07 ..
dr-xr-xr-x. 1 root root 103 Feb 7 13:08 driver
-rw-r--r--. 1 root root 6 Feb 7 13:09 nvidia-driver.pid
drwxr-xr-x. 2 root root 40 Feb 7 13:07 validations
NVIDIA packages directory: `ls -la /usr/local/nvidia/toolkit`
total 12916
drwxr-xr-x. 3 root root 4096 Feb 1 13:45 .
drwxr-xr-x. 3 root root 21 Feb 1 13:45 ..
drwxr-xr-x. 3 root root 38 Feb 1 13:45 .config
lrwxrwxrwx. 1 root root 32 Feb 1 13:45 libnvidia-container-go.so.1 -> libnvidia-container-go.so.1.11.0
-rwxr-xr-x. 1 root root 2959400 Feb 1 13:45 libnvidia-container-go.so.1.11.0
lrwxrwxrwx. 1 root root 29 Feb 1 13:45 libnvidia-container.so.1 -> libnvidia-container.so.1.11.0
-rwxr-xr-x. 1 root root 191784 Feb 1 13:45 libnvidia-container.so.1.11.0
-rwxr-xr-x. 1 root root 154 Feb 1 13:45 nvidia-container-cli
-rwxr-xr-x. 1 root root 48072 Feb 1 13:45 nvidia-container-cli.real
-rwxr-xr-x. 1 root root 342 Feb 1 13:45 nvidia-container-runtime
-rwxr-xr-x. 1 root root 414 Feb 1 13:45 nvidia-container-runtime-experimental
-rwxr-xr-x. 1 root root 203 Feb 1 13:45 nvidia-container-runtime-hook
-rwxr-xr-x. 1 root root 2142816 Feb 1 13:45 nvidia-container-runtime-hook.real
-rwxr-xr-x. 1 root root 3771792 Feb 1 13:45 nvidia-container-runtime.experimental
-rwxr-xr-x. 1 root root 4079768 Feb 1 13:45 nvidia-container-runtime.real
lrwxrwxrwx. 1 root root 29 Feb 1 13:45 nvidia-container-toolkit -> nvidia-container-runtime-hook
NVIDIA driver directory: `ls -la /run/nvidia/driver`
total 0
dr-xr-xr-x. 1 root root 103 Feb 7 13:08 .
drwxr-xr-x. 4 root root 100 Feb 7 13:09 ..
lrwxrwxrwx. 1 root root 7 Jun 21 2021 bin -> usr/bin
dr-xr-xr-x. 2 root root 6 Jun 21 2021 boot
drwxr-xr-x. 16 root root 3100 Feb 7 13:09 dev
drwxr-xr-x. 1 root root 43 Feb 7 13:08 drivers
drwxr-xr-x. 1 root root 68 Feb 7 13:09 etc
drwxr-xr-x. 2 root root 6 Jun 21 2021 home
drwxr-xr-x. 2 root root 24 Feb 7 13:08 host-etc
lrwxrwxrwx. 1 root root 7 Jun 21 2021 lib -> usr/lib
lrwxrwxrwx. 1 root root 9 Jun 21 2021 lib64 -> usr/lib64
drwxr-xr-x. 2 root root 38 Dec 6 19:28 licenses
drwx------. 2 root root 6 Oct 19 04:46 lost+found
drwxr-xr-x. 2 root root 6 Jun 21 2021 media
drwxr-xr-x. 1 root root 42 Feb 7 13:08 mnt
drwxr-xr-x. 2 root root 6 Jun 21 2021 opt
dr-xr-xr-x. 2895 root root 0 Feb 7 13:06 proc
dr-xr-x---. 3 root root 213 Oct 19 04:57 root
drwxr-xr-x. 1 root root 136 Feb 7 13:09 run
lrwxrwxrwx. 1 root root 8 Jun 21 2021 sbin -> usr/sbin
drwxr-xr-x. 2 root root 6 Jun 21 2021 srv
dr-xr-xr-x. 13 root root 0 Feb 7 13:07 sys
drwxrwxrwx. 1 root root 18 Feb 7 13:09 tmp
drwxr-xr-x. 1 root root 65 Oct 19 04:47 usr
drwxr-xr-x. 1 root root 30 Oct 19 04:47 var
@Alwinator From the driver pod logs you posted, looks like driver install is successful. Can you exec into that container and run "nvidia-smi"?
oc exec -n nvidia-gpu-operator nvidia-driver-daemonset-411.86.202212072103-0-8k4vr -- nvidia-smi
@shivamerla
Thu Feb 9 13:25:22 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A30 On | 00000000:21:00.0 Off | On |
| N/A 31C P0 27W / 165W | 0MiB / 24576MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A30 On | 00000000:81:00.0 Off | On |
| N/A 31C P0 29W / 165W | 0MiB / 24576MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A30 On | 00000000:E2:00.0 Off | On |
| N/A 31C P0 29W / 165W | 0MiB / 24576MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| No MIG devices found |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
@Alwinator If nvidia-smi
is successful then driver Daemonset will create the file /run/nvidia/validations/.driver-ctr-ready
from the startup probe here. Is it possible to double check if this status file got created on the worker node but toolkit pod is not seeing it for some reason? If this is true you can try restarting container-toolkit pods for driver readiness checks to pass.
@shivamerla The file does not exist. There is not even an Nvidia folder in the run folder.
# cat /run/nvidia/validations/.driver-ctr-ready
cat: /run/nvidia/validations/.driver-ctr-ready: No such file or directory
# cd /run
# ls
blkid console cryptsetup faillock lock log rhsm secrets sepermit setrans systemd user
From the logs attached to this issue earlier, looks like driver directory is mounted. May be when you were checking driver container restarted for some reason and unmounted /run/nvidia directory?
ls -la /run/nvidia/driver
total 0
dr-xr-xr-x. 1 root root 103 Feb 7 13:08 .
drwxr-xr-x. 4 root root 100 Feb 7 13:09 ..
lrwxrwxrwx. 1 root root 7 Jun 21 2021 bin -> usr/bin
dr-xr-xr-x. 2 root root 6 Jun 21 2021 boot
drwxr-xr-x. 16 root root 3100 Feb 7 13:09 dev
drwxr-xr-x. 1 root root 43 Feb 7 13:08 drivers
drwxr-xr-x. 1 root root 68 Feb 7 13:09 etc
drwxr-xr-x. 2 root root 6 Jun 21 2021 home
drwxr-xr-x. 2 root root 24 Feb 7 13:08 host-etc
lrwxrwxrwx. 1 root root 7 Jun 21 2021 lib -> usr/lib
lrwxrwxrwx. 1 root root 9 Jun 21 2021 lib64 -> usr/lib64
drwxr-xr-x. 2 root root 38 Dec 6 19:28 licenses
drwx------. 2 root root 6 Oct 19 04:46 lost+found
drwxr-xr-x. 2 root root 6 Jun 21 2021 media
drwxr-xr-x. 1 root root 42 Feb 7 13:08 mnt
drwxr-xr-x. 2 root root 6 Jun 21 2021 opt
dr-xr-xr-x. 2895 root root 0 Feb 7 13:06 proc
dr-xr-x---. 3 root root 213 Oct 19 04:57 root
drwxr-xr-x. 1 root root 136 Feb 7 13:09 run
lrwxrwxrwx. 1 root root 8 Jun 21 2021 sbin -> usr/sbin
drwxr-xr-x. 2 root root 6 Jun 21 2021 srv
dr-xr-xr-x. 13 root root 0 Feb 7 13:07 sys
drwxrwxrwx. 1 root root 18 Feb 7 13:09 tmp
drwxr-xr-x. 1 root root 65 Oct 19 04:47 usr
drwxr-xr-x. 1 root root 30 Oct 19 04:47 var
Is Driver container constantly restarting?
I have executed the ls -la /run/nvidia/driver
on all GPU nodes on this cluster. And every node shows that this directory does not exist. Additionally, the driver pod was not restarted for one week. No pods are restarting, except the nvidia-dcgm, because they have a CrashLoopBackOff . This is caused because many other nvidia pods are stuck in init state.
I am seeing this same exact issue, with the exception that this is not on openshift
additionally I don't see the startupProbe set on the pods.
+1
+1
I have encountered the same problem. Have you solved it?
Is there any recent update of linux kernel happened on the node ? or any restart of node happened? If yes, please check the kernel version and it's compatibility with the GPU operator version you are trying to install.
helm install --wait --generate-name
-n gpu-operator --create-namespace
nvidia/gpu-operator
--set toolkit.enabled=false
I read the official website that you can deploy in this way without a driver, but my physical machine has not deployed nvidia driver, and an error has been reported. In the nvida-operator-validator command, "running command chroot with args [/run/nvidia/driver nvida-smi]
chroot: failed to run command 'nvidia-smi': No such file or directory
command failed, retrying after 5 seconds", is my understanding wrong, the host must install the driver?
I believe you are trying to install latest gpu-operator. Can you please provide the output of this command. uname -sra
Also you should install with toolkit.enabled=true. I believe your node does not have the toolkit already installed.
I believe you are trying to install latest gpu-operator. Can you please provide the output of this command. uname -sra
Also you should install with toolkit.enabled=true. I believe your node does not have the toolkit already installed.
Thank,I have solved it.
I believe you are trying to install latest gpu-operator. Can you please provide the output of this command. uname -sra
Also you should install with toolkit.enabled=true. I believe your node does not have the toolkit already installed.
Hi, I have manually installed the nvidia gpu driver on the workstation. When I helm installed the latest gpu-operator(v23.9.2), I directly set driver.enabled to false, and toolkit.enabled default to true. After the installation was successful, the pod for testing the gpu also ran successfully, but when restarted the gpu node, I also encountered the similar problem.