gpu-operator
gpu-operator copied to clipboard
Nodes stuck in upgrade
In a Rancher-provisioned bare metal cluster I have two GPU nodes that cannot finish upgrade, their status is validation-required and pod-restart-required.
1. Quick Debug Information
- OS/Version - Ubuntu22.04:
- Kernel Version: 5.15.0-89-generic
- Container Runtime: Docker 24.0
- K8s Flavor: 1.24.17 (RKE)
- GPU Operator Version: 23.9.0
- Operator installation: helm with
driver.enabled=false
andtoolkit.enabled=false
2. Issue or feature description
I had a node running GPU operator. I then added another GPU node but the operator did not successfully pick upo the new node. I then manually upgraded driver version and container toolkit. The two nodes now appear as "Ready", but the gpu operator for the former node is now "validation-required" and for the new node it is "pod-restart-required". I currently have no GPU workloads in my cluster.
Are there any more manual steps to be done to complete the upgrade ?
4. Information to attach (optional if deemed irrelevant)
- [X ] kubernetes pods status:
kubectl get pods -n gpu-operator
gpu-feature-discovery-ndl8l 0/1 Init:0/1 0 4m2s
gpu-operator-58bdb8567f-n2bkv 1/1 Running 3 (12h ago) 13h
gpu-operator-node-feature-discovery-gc-766df9cf89-796dd 1/1 Running 2 (12h ago) 13h
gpu-operator-node-feature-discovery-master-dcb5c5d74-5hf4m 1/1 Running 3 (12h ago) 13h
gpu-operator-node-feature-discovery-worker-54qql 1/1 Running 0 13h
gpu-operator-node-feature-discovery-worker-54zx6 1/1 Running 0 13h
gpu-operator-node-feature-discovery-worker-8pbzw 1/1 Running 0 13h
gpu-operator-node-feature-discovery-worker-k9npn 1/1 Running 0 13h
gpu-operator-node-feature-discovery-worker-swr78 1/1 Running 3 (12h ago) 13h
nvidia-dcgm-exporter-44ln8 0/1 Init:0/1 0 13h
nvidia-device-plugin-daemonset-jpn9s 0/1 Init:0/1 0 13h
nvidia-operator-validator-lc9x6 0/1 Init:0/4 0 13h
Logs of container toolkit-validation
in e.g. pod nvidia-device-plugin-daemonset-jpn9s
have lots of messages saying
waiting for nvidia container stack to be setup
.
In addition, logs of container driver-validation
in pod nvidia-operator-validator-lc9x6
instead have suspicious messages like this:
running command bash with args [-c stat /run/nvidia/validations/.driver-ctr-ready]
stat: cannot statx '/run/nvidia/validations/.driver-ctr-ready': No such file or directory
command failed, retrying after 5 seconds
On the CLI of the nodes itself, nvidia-smi
is reporting card and driver fine:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1070 On | 00000000:1C:00.0 Off | N/A |
| 0% 33C P8 12W / 170W | 1MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
what is the output of which nvidia-smi
when you run it on your node?
It is at /usr/bin/nvidia-smi
(on both nodes).
Sorry for the late response @LarsAC . Can you please run the must-gather.sh
script and send the generated artifacts over to [email protected] ?
Thanks, archive sent.
I have encountered the same problem. Have you solved it?
Unfortunately not, but have not tried further as I am left without ideas. Will probably try to install the OS from scratch and try again.