gpu-operator
gpu-operator copied to clipboard
Pods stuck in Terminating after upgrade to v1.11.1
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
- [ ] Are you running on an Ubuntu 18.04 node?
- [x] Are you running Kubernetes v1.13+?
- [ ] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- [ ] Do you have
i2c_coreandipmi_msghandlerloaded on the nodes?ipmi_msghandleris running, but noti2c_core - [x] Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces)
1. Issue or feature description
After upgrading from gpu-operator v1.10.0 to v1.11.1, the stack does not seem to come up cleanly without manual intervention. I end up with the gpu-feature-discovery, nvidia-dcgm-exporter and nvidia-device-plugin-daemonset pods stuck in Terminating. Manually restarting the container toolkit by either deleting the nvidia-container-toolkit-daemonset pod or doing kubectl rollout restart daemonset nvidia-container-toolkit-daemonset seems to resolve the problem, but I shouldn't need to manually intervene.
2. Steps to reproduce the issue
Install GPU operator from the helm chart (via Argo CD) using these values:
nfd:
enabled: true
mig:
strategy: mixed
driver:
version: "515.48.07"
rdma:
enabled: false
manager:
env:
- name: ENABLE_AUTO_DRAIN
value: "true"
- name: DRAIN_USE_FORCE
value: "true"
- name: DRAIN_POD_SELECTOR_LABEL
value: ""
- name: DRAIN_TIMEOUT_SECONDS
value: "0s"
- name: DRAIN_DELETE_EMPTYDIR_DATA
value: "true"
toolkit:
enabled: true
version: "v1.10.0-centos7"
dcgmExporter:
version: "2.4.5-2.6.7-ubuntu20.04"
migManager:
enabled: true
config:
name: mig-parted-config
vgpuManager:
enabled: false
vgpuDeviceManager:
enabled: false
vfioManager:
enabled: false
sandboxDevicePlugin:
enabled: false
Remove and re-add the operands from a node:
kubectl label node dev-worker-gpu-0 nvidia.com/gpu.deploy.operands=false
kubectl label node dev-worker-gpu-0 nvidia.com/gpu.deploy.operands-
After waiting a few minutes for the driver and toolkit pods to become ready, several pods are stuck in terminating and cuda-validator is in Init:Error.
kubectl get pods -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-tz7kr 0/1 Terminating 0 3m58s
gpu-operator-84d9f557c8-2jtdp 1/1 Running 0 125m
nvidia-container-toolkit-daemonset-d6mzq 1/1 Running 0 3m58s
nvidia-cuda-validator-47lz5 0/1 Init:Error 4 103s
nvidia-dcgm-exporter-zhwr6 0/1 Terminating 0 3m58s
nvidia-device-plugin-daemonset-r4gfd 0/1 Terminating 1 3m58s
nvidia-device-plugin-validator-qkq9d 0/1 Completed 0 99m
nvidia-driver-daemonset-78dn5 1/1 Running 0 3m58s
nvidia-gpu-operator-node-feature-discovery-master-79bb9ff4jdtgj 1/1 Running 0 125m
nvidia-gpu-operator-node-feature-discovery-worker-j6zwm 1/1 Running 2 124m
nvidia-gpu-operator-node-feature-discovery-worker-mk9qd 1/1 Running 0 124m
nvidia-gpu-operator-node-feature-discovery-worker-xxvgd 1/1 Running 0 124m
nvidia-mig-manager-hhdq6 1/1 Running 0 87s
nvidia-operator-validator-p4h7q 0/1 Init:2/4 0 3m49s
I don't see anything unusual in either the driver or container toolkit pod logs.
If I manually restart the container toolkit and wait a few minutes, everything comes up as expected.
kubectl rollout restart daemonset nvidia-container-toolkit-daemonset -n nvidia-gpu-operator
kubectl get pods -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-rml5z 1/1 Running 0 114s
gpu-operator-84d9f557c8-2jtdp 1/1 Running 0 134m
nvidia-container-toolkit-daemonset-9k8mg 1/1 Running 0 67s
nvidia-cuda-validator-gzjrj 0/1 Completed 0 37s
nvidia-dcgm-exporter-wmzcz 1/1 Running 0 114s
nvidia-device-plugin-daemonset-mpmdz 1/1 Running 0 114s
nvidia-device-plugin-validator-425b8 0/1 Completed 0 30s
nvidia-driver-daemonset-78dn5 1/1 Running 0 12m
nvidia-gpu-operator-node-feature-discovery-master-79bb9ff4jdtgj 1/1 Running 0 134m
nvidia-gpu-operator-node-feature-discovery-worker-j6zwm 1/1 Running 2 133m
nvidia-gpu-operator-node-feature-discovery-worker-mk9qd 1/1 Running 0 133m
nvidia-gpu-operator-node-feature-discovery-worker-xxvgd 1/1 Running 0 133m
nvidia-mig-manager-hhdq6 1/1 Running 0 10m
nvidia-operator-validator-ztbjv 1/1 Running 0 77s
3. Information to attach (optional if deemed irrelevant)
- [x] kubernetes daemonset status:
kubectl get ds --all-namespaces
kubectl get ds -n nvidia-gpu-operator gpu-vol-mounts ✭ ✱
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-feature-discovery 0 0 0 0 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 125m
nvidia-container-toolkit-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.container-toolkit=true 125m
nvidia-dcgm-exporter 0 0 0 0 0 nvidia.com/gpu.deploy.dcgm-exporter=true 125m
nvidia-device-plugin-daemonset 0 0 0 0 0 nvidia.com/gpu.deploy.device-plugin=true 125m
nvidia-driver-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.driver=true 125m
nvidia-gpu-operator-node-feature-discovery-worker 3 3 3 3 3 <none> 17d
nvidia-mig-manager 1 1 1 1 1 nvidia.com/gpu.deploy.mig-manager=true 125m
nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 125m
-
[x] If a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAMEdescribe-cuda-validator.txt describe-gpu-feature-discovery.txt describe-device-plugin.txt -
[x] If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAMElogs-cuda-validator.txt logs-driver.txt logs-container-toolkit.txt -
[ ] Output of running a container on the GPU machine:
docker run -it alpine echo foo -
[ ] Docker configuration file:
cat /etc/docker/daemon.json -
[ ] Docker runtime configuration:
docker info | grep runtime -
[x] NVIDIA shared directory:
ls -la /run/nvidiashared.txt -
[x] NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkitpackages.txt -
[x] NVIDIA driver directory:
ls -la /run/nvidia/driverdriver.txt -
[x] kubelet logs
journalctl -u kubelet > kubelet.logskubelet.logs.gz
Other Info: K8s 1.21.10 containerd 1.6.1 containerd-config.toml.txt CentOS 7.9.2009
@neggert Thanks for reporting this, will try to reproduce. Question, why was this step done after upgrade to v1.11.1?
kubectl label node dev-worker-gpu-0 nvidia.com/gpu.deploy.operands=false
kubectl label node dev-worker-gpu-0 nvidia.com/gpu.deploy.operands-
I wanted to check to see if the issue was a result of upgrading existing nodes in place. In an attempt to rule that out, I used the label to completely remove the GPU operator from the node, then re-deploy it. In the past, I've found that this is a good way to "reset" anything related to the GPU operator that gets into a weird state.
I get the same result whether I include that step or not, so I don't think the issue is related to the upgrade process.
@shivamerla Any luck in reproducing this? Happy to provide more info if you let me know what you need.
@neggert Can you attach /var/log/messages or logs from journalctl -xb > journal.log, This might help us to understand if containerd is reloaded correctly after toolkit upgrade. If it got into error state after first reset, that might explain why containers were not reaped correctly.
toolkit log:
time="2022-08-26T18:19:31Z" level=info msg="Successfully loaded config"
time="2022-08-26T18:19:31Z" level=info msg="Config version: 2"
time="2022-08-26T18:19:31Z" level=info msg="Updating config"
time="2022-08-26T18:19:31Z" level=info msg="Successfully updated config"
time="2022-08-26T18:19:31Z" level=info msg="Flushing config"
time="2022-08-26T18:19:31Z" level=info msg="Successfully flushed config"
time="2022-08-26T18:19:31Z" level=info msg="Sending SIGHUP signal to containerd"
time="2022-08-26T18:19:31Z" level=info msg="Successfully signaled containerd"
time="2022-08-26T18:19:31Z" level=info msg="Completed 'setup' for containerd"
time="2022-08-26T18:19:31Z" level=info msg="Waiting for signal"
but device-plugin is stuck in terminating as containerd is not ready yet.
Warning FailedCreatePodSandBox 4m57s kubelet Failed to create pod sandbox: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/containerd/containerd.sock: connect: connection refused"
Getting slightly different behavior when I went to reproduce this again. Now things are stuck in a crash loop
gpu-feature-discovery-6rb27 0/1 Init:CrashLoopBackOff 11 37m
gpu-operator-84d9f557c8-gp9p4 1/1 Running 0 37m
nvidia-container-toolkit-daemonset-blbt4 1/1 Running 0 37m
nvidia-dcgm-exporter-cwnhb 0/1 Init:CrashLoopBackOff 11 37m
nvidia-device-plugin-daemonset-gsbdr 0/1 Init:CrashLoopBackOff 11 37m
nvidia-driver-daemonset-7hq8v 1/1 Running 0 37m
nvidia-gpu-operator-node-feature-discovery-master-79bb9ff4tsqxv 1/1 Running 0 37m
nvidia-gpu-operator-node-feature-discovery-worker-kkdcp 1/1 Running 0 37m
nvidia-gpu-operator-node-feature-discovery-worker-m9qpd 1/1 Running 0 37m
nvidia-gpu-operator-node-feature-discovery-worker-tqhj2 1/1 Running 0 37m
nvidia-mig-manager-m6nlc 0/1 Init:CrashLoopBackOff 11 37m
nvidia-operator-validator-j24lb 0/1 Init:CrashLoopBackOff 11 36m
Events look like this:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 5m54s default-scheduler Successfully assigned nvidia-gpu-operator/nvidia-device-plugin-daemonset-gsbdr to dev-worker-gpu-0
Warning FailedCreatePodSandBox 3m48s (x11 over 5m54s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
Warning FailedCreatePodSandBox 3m33s kubelet Failed to create pod sandbox: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/containerd/containerd.sock: connect: connection refused"
Normal Pulled 3m2s (x3 over 3m20s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.11.1" already present on machine
Normal Created 3m2s (x3 over 3m20s) kubelet Created container toolkit-validation
Warning Failed 3m1s (x3 over 3m20s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli.real: mount error: stat failed: /run/nvidia/driver/proc/driver/nvidia/gpus/0000:13:00.0: no such file or directory: unknown
Warning BackOff 48s (x13 over 3m19s) kubelet Back-off restarting failed container
As before, restarting the container-toolkit pod resolves the problem.
journal logs are attached. journal.log.gz
The above problem seems to happen because the driver container is not populating /run/nvidia/driver/proc/driver/nvidia. Rebooting the node seems to revolve that. Not ideal, but I think it might be a separate issue.
I see pods stuck in Terminating when the node comes up after reboot, so it's actually a great way to get some nice clean logs that demonstrate the problem.
Pods:
gpu-feature-discovery-mcfqc 0/1 Terminating 0 25m
gpu-operator-84d9f557c8-fw8sv 1/1 Running 0 25m
nvidia-container-toolkit-daemonset-f24pl 1/1 Running 1 24m
nvidia-cuda-validator-l4gqp 0/1 Init:CrashLoopBackOff 2 40s
nvidia-dcgm-exporter-ms8td 0/1 Terminating 5 25m
nvidia-device-plugin-daemonset-jmkl5 0/1 Terminating 1 25m
nvidia-driver-daemonset-dmgfx 1/1 Running 1 24m
nvidia-gpu-operator-node-feature-discovery-master-79bb9ff4p6dmn 1/1 Running 0 25m
nvidia-gpu-operator-node-feature-discovery-worker-2jpxg 1/1 Running 0 24m
nvidia-gpu-operator-node-feature-discovery-worker-cq2pz 1/1 Running 1 25m
nvidia-gpu-operator-node-feature-discovery-worker-hhwz6 1/1 Running 1 25m
nvidia-mig-manager-xtvlb 1/1 Running 0 25m
nvidia-operator-validator-9cgqc 0/1 Init:2/4 1 8m35s
Logs: journal.log.gz
@neggert Can you set env for toolkit with --set toolkit.env[0].name=CONTAINERD_RESTART_MODE --set toolkit.env[0].value=none
This will avoid containerd reloads in your case and for upgrades we don't really need a reload as nvidia-container-runtime binary path is same. We will continue to look into root cause of this.
@shivamerla That does solve the problem with the upgrade, but it means that we need to manually log into the node to restart containerd when adding, removing, or reconfiguring the container toolkit. I'd rather restart the daemonset :P.
@neggert Agree, this is just a workaround until we figure out why containerd reloads are causing issue in your case. Currently we don't modify the runtime config over operator/toolkit upgrades, it remains the same but we still do runtime reload. The above workaround fixes that issue. That said, the toolkit config might change with the future versions so runtime reload is still required in those cases.
@neggert Can you set env for toolkit with
--set toolkit.env[0].name=CONTAINERD_RESTART_MODE--settoolkit.env[0].value=noneThis will avoid containerd reloads in your case and for upgrades we don't really need a reload as
nvidia-container-runtimebinary path is same. We will continue to look into root cause of this.
I set these env for toolkit, the toolkit was up but the dcgm-exporter and device plugin failed with this error
kubectl describe po nvidia-dcgm-exporter-bhl8f -n gpu
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
If I unset the CONTAINERD_RESTART_MODE env for toolkit, all the gpu related pods start running
@alloydm that workaround was mentioned specifically when containerd was not handling restarts properly. Were you seeing same behavior that you needed to apply this? By default we want container-toolkit to be able to reload containerd after applying config for nvidia-container-runtime, without which none of the other operator pods would come up.
@shivamerla Oh okay, Thank you for clarification. I thought with latest change there is no need to restart containerd. Thank you for the quick update
@shivamerla Is there any possible way/ feature to skip containerd reload (restart) and handle nvidia-container-runtime config any other way?