gpu-operator Nodes stuck in upgrade

In a Rancher-provisioned bare metal cluster I have two GPU nodes that cannot finish upgrade, their status is validation-required and pod-restart-required.

1. Quick Debug Information

OS/Version - Ubuntu22.04:
Kernel Version: 5.15.0-89-generic
Container Runtime: Docker 24.0
K8s Flavor: 1.24.17 (RKE)
GPU Operator Version: 23.9.0
Operator installation: helm with driver.enabled=false and toolkit.enabled=false

2. Issue or feature description

I had a node running GPU operator. I then added another GPU node but the operator did not successfully pick upo the new node. I then manually upgraded driver version and container toolkit. The two nodes now appear as "Ready", but the gpu operator for the former node is now "validation-required" and for the new node it is "pod-restart-required". I currently have no GPU workloads in my cluster.

Are there any more manual steps to be done to complete the upgrade ?

4. Information to attach (optional if deemed irrelevant)

[X ] kubernetes pods status: kubectl get pods -n gpu-operator

gpu-feature-discovery-ndl8l                                  0/1     Init:0/1   0             4m2s
gpu-operator-58bdb8567f-n2bkv                                1/1     Running    3 (12h ago)   13h
gpu-operator-node-feature-discovery-gc-766df9cf89-796dd      1/1     Running    2 (12h ago)   13h
gpu-operator-node-feature-discovery-master-dcb5c5d74-5hf4m   1/1     Running    3 (12h ago)   13h
gpu-operator-node-feature-discovery-worker-54qql             1/1     Running    0             13h
gpu-operator-node-feature-discovery-worker-54zx6             1/1     Running    0             13h
gpu-operator-node-feature-discovery-worker-8pbzw             1/1     Running    0             13h
gpu-operator-node-feature-discovery-worker-k9npn             1/1     Running    0             13h
gpu-operator-node-feature-discovery-worker-swr78             1/1     Running    3 (12h ago)   13h
nvidia-dcgm-exporter-44ln8                                   0/1     Init:0/1   0             13h
nvidia-device-plugin-daemonset-jpn9s                         0/1     Init:0/1   0             13h
nvidia-operator-validator-lc9x6                              0/1     Init:0/4   0             13h

Logs of container toolkit-validation in e.g. pod nvidia-device-plugin-daemonset-jpn9s have lots of messages saying waiting for nvidia container stack to be setup.

In addition, logs of container driver-validation in pod nvidia-operator-validator-lc9x6 instead have suspicious messages like this:

running command bash with args [-c stat /run/nvidia/validations/.driver-ctr-ready]
stat: cannot statx '/run/nvidia/validations/.driver-ctr-ready': No such file or directory
command failed, retrying after 5 seconds

On the CLI of the nodes itself, nvidia-smi is reporting card and driver fine:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1070        On  | 00000000:1C:00.0 Off |                  N/A |
|  0%   33C    P8              12W / 170W |      1MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Dec 04 '23 08:12 LarsAC

what is the output of which nvidia-smi when you run it on your node?

Dec 04 '23 18:12 tariq1890

It is at /usr/bin/nvidia-smi (on both nodes).

Dec 04 '23 18:12 LarsAC

Sorry for the late response @LarsAC . Can you please run the must-gather.sh script and send the generated artifacts over to [email protected] ?

Dec 15 '23 22:12 tariq1890

Thanks, archive sent.

Dec 22 '23 20:12 LarsAC

I have encountered the same problem. Have you solved it?

Apr 15 '24 08:04 FanKang2021

Unfortunately not, but have not tried further as I am left without ideas. Will probably try to install the OS from scratch and try again.

Apr 16 '24 19:04 LarsAC

gpu-operator gpu-operator copied to clipboard

Nodes stuck in upgrade

1. Quick Debug Information

2. Issue or feature description

4. Information to attach (optional if deemed irrelevant)

gpu-operator
gpu-operator copied to clipboard