gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Some pods are stuck in init on one of our clusters

Open Alwinator opened this issue 2 years ago • 18 comments

1. Quick Debug Checklist

  • Are you running on an Ubuntu 18.04 node? No, I am running Red Hat Enterprise Linux 8.6 (Ootpa)
  • Are you running Kubernetes v1.13+? Yes, I am running OpenShift 4.11.35 with Kubernetes 1.23
  • Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? Yes, CRI-O
  • GPU Operator version: 22.9.2

1. Issue or feature description

On one of our clusters, many Nvidia pods are stuck in init. I checked the logs and could not find something suspicious. Maybe there are other logs that tell more?

I suspect the problem appeared after the migration to OpenShift 4.11.

2. Steps to reproduce the issue

Since this is a production cluster it is influenced by hundreds of customers, it is quite hard to find a way to reproduce, however here are the possible ways to reproduce:

  1. Setup OpenShift with version 4.11.20
  2. Install NFD Operator 4.11.0-202212070335
  3. Install GPU Operator version 22.9.2
  4. Use Nvidia A30 GPUs

3. Information to attach

Logs of one of the GPU Feature Discovery pods stuck in init (gpu-feature-discovery-ddw67)
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
...
Logs of one of the Mig Manager pods stuck in init (nvidia-mig-manager-b4675)
waiting for nvidia container toolkit to be setup
waiting for nvidia container toolkit to be setup
waiting for nvidia container toolkit to be setup
waiting for nvidia container toolkit to be setup
waiting for nvidia container toolkit to be setup
...
Logs of one of the Toolkit Deamonset pods stuck in init (nvidia-container-toolkit-daemonset-664n2)
time="2023-02-07T13:08:04Z" level=info msg="Driver is not pre-installed on the host. Checking driver container status."
running command bash with args [-c stat /run/nvidia/validations/.driver-ctr-ready]
stat: cannot statx '/run/nvidia/validations/.driver-ctr-ready': No such file or directory
command failed, retrying after 5 seconds
running command bash with args [-c stat /run/nvidia/validations/.driver-ctr-ready]
stat: cannot statx '/run/nvidia/validations/.driver-ctr-ready': No such file or directory
command failed, retrying after 5 seconds
running command bash with args [-c stat /run/nvidia/validations/.driver-ctr-ready]
stat: cannot statx '/run/nvidia/validations/.driver-ctr-ready': No such file or directory
command failed, retrying after 5 seconds
...
Logs of one of the Driver Deamonset pods stuck in init (nvidia-driver-daemonset-411.86.202212072103-0-8k4vr)
Running nv-ctr-run-with-dtk
...
WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.
WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that NVIDIA kernel modules matching this driver version are installed separately.
WARNING: This NVIDIA driver package includes Vulkan components, but no Vulkan ICD loader was detected on this system. The NVIDIA Vulkan ICD will not function without the loader. Most distributions package the Vulkan loader; try installing the "vulkan-loader", "vulkan-icd-loader", or "libvulkan1" package.
WARNING: Unable to determine the path to install the libglvnd EGL vendor library config files. Check that you have pkg-config and the libglvnd development libraries installed, or specify a path with --glvnd-egl-config-path.
...
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 525.60.13 for Linux kernel version 4.18.0-372.36.1.el8_6.x86_64
...
Starting NVIDIA persistence daemon...
ls: cannot access '/proc/driver/nvidia-nvswitch/devices/*': No such file or directory
Mounting NVIDIA driver rootfs...
Change device files security context for selinux compatibility
Done, now waiting for signal

oc get pods -n nvidia-gpu-operator
NAME                                                  READY   STATUS             RESTARTS        AGE
gpu-feature-discovery-2x45b                           1/1     Running            0               7d8h
gpu-feature-discovery-ddw67                           0/1     Init:0/1           0               42h
gpu-feature-discovery-hxcpm                           1/1     Running            0               7d8h
gpu-operator-66c69d4d8b-7ll7f                         1/1     Running            0               42h
nvidia-container-toolkit-daemonset-62b62              1/1     Running            0               7d19h
nvidia-container-toolkit-daemonset-664n2              0/1     Init:0/1           0               42h
nvidia-container-toolkit-daemonset-hwbpw              0/1     Init:0/1           0               42h
nvidia-cuda-validator-9j7mq                           0/1     Completed          0               7d19h
nvidia-cuda-validator-czqfk                           0/1     Completed          0               7d19h
nvidia-dcgm-exporter-4rnxn                            1/1     Running            0               7d8h
nvidia-dcgm-exporter-79mqk                            0/1     Init:0/2           0               42h
nvidia-dcgm-exporter-cv6rg                            0/1     CrashLoopBackOff   495 (32s ago)   42h
nvidia-dcgm-jlznx                                     0/1     Init:0/1           0               42h
nvidia-dcgm-klpt5                                     1/1     Running            0               7d8h
nvidia-dcgm-pjsqb                                     0/1     CrashLoopBackOff   503 (25s ago)   42h
nvidia-device-plugin-daemonset-g42hg                  1/1     Running            0               7d19h
nvidia-device-plugin-daemonset-jsg8j                  0/1     Init:0/1           0               42h
nvidia-device-plugin-daemonset-rkhgx                  0/1     Init:0/1           0               42h
nvidia-device-plugin-validator-f5p4w                  0/1     Completed          0               7d19h
nvidia-device-plugin-validator-tmszz                  0/1     Completed          0               7d19h
nvidia-driver-daemonset-411.86.202212072103-0-8k4vr   2/2     Running            2               7d17h
nvidia-driver-daemonset-411.86.202212072103-0-9jdnc   2/2     Running            0               7d19h
nvidia-driver-daemonset-411.86.202212072103-0-cvtc4   2/2     Running            0               7d19h
nvidia-mig-manager-6qdfb                              1/1     Running            0               7d8h
nvidia-mig-manager-b4675                              0/1     Init:0/1           0               42h
nvidia-mig-manager-glb7z                              0/1     Init:0/1           0               42h
nvidia-node-status-exporter-dvfdg                     1/1     Running            0               7d8h
nvidia-node-status-exporter-jl9x5                     1/1     Running            2               7d8h
nvidia-node-status-exporter-jmbvp                     1/1     Running            0               7d8h
nvidia-operator-validator-cghvm                       1/1     Running            0               7d19h
nvidia-operator-validator-gq5g2                       0/1     Init:0/4           0               42h
nvidia-operator-validator-lwd4c                       0/1     Init:0/4           0               42h
oc get ds -n nvidia-gpu-operator
NAME                                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                                         AGE
gpu-feature-discovery                           3         3         2       3            2           nvidia.com/gpu.deploy.gpu-feature-discovery=true                                                                      107d
nvidia-container-toolkit-daemonset              3         3         1       2            1           nvidia.com/gpu.deploy.container-toolkit=true                                                                          107d
nvidia-dcgm                                     3         3         1       3            1           nvidia.com/gpu.deploy.dcgm=true                                                                                       107d
nvidia-dcgm-exporter                            3         3         1       3            1           nvidia.com/gpu.deploy.dcgm-exporter=true                                                                              107d
nvidia-device-plugin-daemonset                  3         3         1       2            1           nvidia.com/gpu.deploy.device-plugin=true                                                                              107d
nvidia-driver-daemonset-411.86.202212072103-0   3         3         3       0            3           feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=411.86.202212072103-0,nvidia.com/gpu.deploy.driver=true   7d21h
nvidia-mig-manager                              3         3         1       3            1           nvidia.com/gpu.deploy.mig-manager=true                                                                                107d
nvidia-node-status-exporter                     3         3         3       3            3           nvidia.com/gpu.deploy.node-status-exporter=true                                                                       107d
nvidia-operator-validator                       3         3         1       2            1           nvidia.com/gpu.deploy.operator-validator=true                                                                         107d
NVIDIA shared directory: `ls -la /run/nvidia`
total 4
drwxr-xr-x.  4 root root  100 Feb  7 13:09 .
drwxr-xr-x. 48 root root 1260 Feb  7 13:07 ..
dr-xr-xr-x.  1 root root  103 Feb  7 13:08 driver
-rw-r--r--.  1 root root    6 Feb  7 13:09 nvidia-driver.pid
drwxr-xr-x.  2 root root   40 Feb  7 13:07 validations
NVIDIA packages directory: `ls -la /usr/local/nvidia/toolkit`
total 12916
drwxr-xr-x. 3 root root    4096 Feb  1 13:45 .
drwxr-xr-x. 3 root root      21 Feb  1 13:45 ..
drwxr-xr-x. 3 root root      38 Feb  1 13:45 .config
lrwxrwxrwx. 1 root root      32 Feb  1 13:45 libnvidia-container-go.so.1 -> libnvidia-container-go.so.1.11.0
-rwxr-xr-x. 1 root root 2959400 Feb  1 13:45 libnvidia-container-go.so.1.11.0
lrwxrwxrwx. 1 root root      29 Feb  1 13:45 libnvidia-container.so.1 -> libnvidia-container.so.1.11.0
-rwxr-xr-x. 1 root root  191784 Feb  1 13:45 libnvidia-container.so.1.11.0
-rwxr-xr-x. 1 root root     154 Feb  1 13:45 nvidia-container-cli
-rwxr-xr-x. 1 root root   48072 Feb  1 13:45 nvidia-container-cli.real
-rwxr-xr-x. 1 root root     342 Feb  1 13:45 nvidia-container-runtime
-rwxr-xr-x. 1 root root     414 Feb  1 13:45 nvidia-container-runtime-experimental
-rwxr-xr-x. 1 root root     203 Feb  1 13:45 nvidia-container-runtime-hook
-rwxr-xr-x. 1 root root 2142816 Feb  1 13:45 nvidia-container-runtime-hook.real
-rwxr-xr-x. 1 root root 3771792 Feb  1 13:45 nvidia-container-runtime.experimental
-rwxr-xr-x. 1 root root 4079768 Feb  1 13:45 nvidia-container-runtime.real
lrwxrwxrwx. 1 root root      29 Feb  1 13:45 nvidia-container-toolkit -> nvidia-container-runtime-hook
NVIDIA driver directory: `ls -la /run/nvidia/driver`
total 0
dr-xr-xr-x.    1 root root  103 Feb  7 13:08 .
drwxr-xr-x.    4 root root  100 Feb  7 13:09 ..
lrwxrwxrwx.    1 root root    7 Jun 21  2021 bin -> usr/bin
dr-xr-xr-x.    2 root root    6 Jun 21  2021 boot
drwxr-xr-x.   16 root root 3100 Feb  7 13:09 dev
drwxr-xr-x.    1 root root   43 Feb  7 13:08 drivers
drwxr-xr-x.    1 root root   68 Feb  7 13:09 etc
drwxr-xr-x.    2 root root    6 Jun 21  2021 home
drwxr-xr-x.    2 root root   24 Feb  7 13:08 host-etc
lrwxrwxrwx.    1 root root    7 Jun 21  2021 lib -> usr/lib
lrwxrwxrwx.    1 root root    9 Jun 21  2021 lib64 -> usr/lib64
drwxr-xr-x.    2 root root   38 Dec  6 19:28 licenses
drwx------.    2 root root    6 Oct 19 04:46 lost+found
drwxr-xr-x.    2 root root    6 Jun 21  2021 media
drwxr-xr-x.    1 root root   42 Feb  7 13:08 mnt
drwxr-xr-x.    2 root root    6 Jun 21  2021 opt
dr-xr-xr-x. 2895 root root    0 Feb  7 13:06 proc
dr-xr-x---.    3 root root  213 Oct 19 04:57 root
drwxr-xr-x.    1 root root  136 Feb  7 13:09 run
lrwxrwxrwx.    1 root root    8 Jun 21  2021 sbin -> usr/sbin
drwxr-xr-x.    2 root root    6 Jun 21  2021 srv
dr-xr-xr-x.   13 root root    0 Feb  7 13:07 sys
drwxrwxrwx.    1 root root   18 Feb  7 13:09 tmp
drwxr-xr-x.    1 root root   65 Oct 19 04:47 usr
drwxr-xr-x.    1 root root   30 Oct 19 04:47 var

Alwinator avatar Feb 09 '23 08:02 Alwinator

@Alwinator From the driver pod logs you posted, looks like driver install is successful. Can you exec into that container and run "nvidia-smi"?

oc exec -n nvidia-gpu-operator nvidia-driver-daemonset-411.86.202212072103-0-8k4vr -- nvidia-smi

shivamerla avatar Feb 09 '23 11:02 shivamerla

@shivamerla

Thu Feb  9 13:25:22 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A30          On   | 00000000:21:00.0 Off |                   On |
| N/A   31C    P0    27W / 165W |      0MiB / 24576MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A30          On   | 00000000:81:00.0 Off |                   On |
| N/A   31C    P0    29W / 165W |      0MiB / 24576MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A30          On   | 00000000:E2:00.0 Off |                   On |
| N/A   31C    P0    29W / 165W |      0MiB / 24576MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  No MIG devices found                                                       |
+-----------------------------------------------------------------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Alwinator avatar Feb 09 '23 13:02 Alwinator

@Alwinator If nvidia-smi is successful then driver Daemonset will create the file /run/nvidia/validations/.driver-ctr-ready from the startup probe here. Is it possible to double check if this status file got created on the worker node but toolkit pod is not seeing it for some reason? If this is true you can try restarting container-toolkit pods for driver readiness checks to pass.

shivamerla avatar Feb 12 '23 01:02 shivamerla

@shivamerla The file does not exist. There is not even an Nvidia folder in the run folder.

# cat /run/nvidia/validations/.driver-ctr-ready
cat: /run/nvidia/validations/.driver-ctr-ready: No such file or directory
# cd /run
# ls    
blkid  console  cryptsetup  faillock  lock  log  rhsm  secrets  sepermit  setrans  systemd  user

Alwinator avatar Feb 13 '23 08:02 Alwinator

From the logs attached to this issue earlier, looks like driver directory is mounted. May be when you were checking driver container restarted for some reason and unmounted /run/nvidia directory?

ls -la /run/nvidia/driver

total 0
dr-xr-xr-x.    1 root root  103 Feb  7 13:08 .
drwxr-xr-x.    4 root root  100 Feb  7 13:09 ..
lrwxrwxrwx.    1 root root    7 Jun 21  2021 bin -> usr/bin
dr-xr-xr-x.    2 root root    6 Jun 21  2021 boot
drwxr-xr-x.   16 root root 3100 Feb  7 13:09 dev
drwxr-xr-x.    1 root root   43 Feb  7 13:08 drivers
drwxr-xr-x.    1 root root   68 Feb  7 13:09 etc
drwxr-xr-x.    2 root root    6 Jun 21  2021 home
drwxr-xr-x.    2 root root   24 Feb  7 13:08 host-etc
lrwxrwxrwx.    1 root root    7 Jun 21  2021 lib -> usr/lib
lrwxrwxrwx.    1 root root    9 Jun 21  2021 lib64 -> usr/lib64
drwxr-xr-x.    2 root root   38 Dec  6 19:28 licenses
drwx------.    2 root root    6 Oct 19 04:46 lost+found
drwxr-xr-x.    2 root root    6 Jun 21  2021 media
drwxr-xr-x.    1 root root   42 Feb  7 13:08 mnt
drwxr-xr-x.    2 root root    6 Jun 21  2021 opt
dr-xr-xr-x. 2895 root root    0 Feb  7 13:06 proc
dr-xr-x---.    3 root root  213 Oct 19 04:57 root
drwxr-xr-x.    1 root root  136 Feb  7 13:09 run
lrwxrwxrwx.    1 root root    8 Jun 21  2021 sbin -> usr/sbin
drwxr-xr-x.    2 root root    6 Jun 21  2021 srv
dr-xr-xr-x.   13 root root    0 Feb  7 13:07 sys
drwxrwxrwx.    1 root root   18 Feb  7 13:09 tmp
drwxr-xr-x.    1 root root   65 Oct 19 04:47 usr
drwxr-xr-x.    1 root root   30 Oct 19 04:47 var

Is Driver container constantly restarting?

shivamerla avatar Feb 13 '23 12:02 shivamerla

I have executed the ls -la /run/nvidia/driver on all GPU nodes on this cluster. And every node shows that this directory does not exist. Additionally, the driver pod was not restarted for one week. No pods are restarting, except the nvidia-dcgm, because they have a CrashLoopBackOff . This is caused because many other nvidia pods are stuck in init state.

Alwinator avatar Feb 14 '23 07:02 Alwinator

I am seeing this same exact issue, with the exception that this is not on openshift

additionally I don't see the startupProbe set on the pods.

warroyo avatar May 01 '23 16:05 warroyo

+1

likku123 avatar Jun 15 '23 05:06 likku123

+1

dgabrysch avatar Dec 20 '23 12:12 dgabrysch

I have encountered the same problem. Have you solved it?

FanKang2021 avatar Apr 15 '24 08:04 FanKang2021

Is there any recent update of linux kernel happened on the node ? or any restart of node happened? If yes, please check the kernel version and it's compatibility with the GPU operator version you are trying to install.

likku123 avatar Apr 15 '24 09:04 likku123

helm install --wait --generate-name
-n gpu-operator --create-namespace
nvidia/gpu-operator
--set toolkit.enabled=false I read the official website that you can deploy in this way without a driver, but my physical machine has not deployed nvidia driver, and an error has been reported. In the nvida-operator-validator command, "running command chroot with args [/run/nvidia/driver nvida-smi] chroot: failed to run command 'nvidia-smi': No such file or directory command failed, retrying after 5 seconds", is my understanding wrong, the host must install the driver?

FanKang2021 avatar Apr 15 '24 09:04 FanKang2021

I believe you are trying to install latest gpu-operator. Can you please provide the output of this command. uname -sra

Also you should install with toolkit.enabled=true. I believe your node does not have the toolkit already installed.

likku123 avatar Apr 15 '24 09:04 likku123

I believe you are trying to install latest gpu-operator. Can you please provide the output of this command. uname -sra

Also you should install with toolkit.enabled=true. I believe your node does not have the toolkit already installed.

Thank,I have solved it.

FanKang2021 avatar Apr 16 '24 01:04 FanKang2021

I believe you are trying to install latest gpu-operator. Can you please provide the output of this command. uname -sra

Also you should install with toolkit.enabled=true. I believe your node does not have the toolkit already installed.

Hi, I have manually installed the nvidia gpu driver on the workstation. When I helm installed the latest gpu-operator(v23.9.2), I directly set driver.enabled to false, and toolkit.enabled default to true. After the installation was successful, the pod for testing the gpu also ran successfully, but when restarted the gpu node, I also encountered the similar problem.

sunwuyan avatar Apr 25 '24 06:04 sunwuyan