containerd restarts at least once an hour
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu22.04
- Kernel Version: 5.15.0-101-generic
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd github.com/containerd/containerd v1.6.8 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): Kubernetes v1.24.6
- GPU Operator Version: v23.9.2
2. Issue or feature description
containerd gets restarted randomly, but at least once every hour. It can't last for more than an hour without restart. Respectively, the following Pods also get restarted:
- gpu-feature-discovery
- nvidia-container-toolkit-daemonset
- nvidia-cuda-validator
- nvidia-dcgm-exporter
- nvidia-device-plugin-daemonset
- nvidia-driver-daemonset
- nvidia-mig-manager
- nvidia-operator-validator
Those are the last log events from all nvidia-container-toolkit-daemonset Pods
time="2024-05-02T13:08:53Z" level=info msg="Setting up runtime"
time="2024-05-02T13:08:53Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]"
time="2024-05-02T13:08:53Z" level=info msg="Successfully parsed arguments"
time="2024-05-02T13:08:53Z" level=info msg="Starting 'setup' for containerd"
time="2024-05-02T13:08:53Z" level=info msg="Loading config from /runtime/config-dir/config.toml"
time="2024-05-02T13:08:53Z" level=info msg="Flushing config to /runtime/config-dir/config.toml"
time="2024-05-02T13:08:53Z" level=info msg="Sending SIGHUP signal to containerd"
time="2024-05-02T13:08:53Z" level=info msg="Waiting for signal"
time="2024-05-02T13:42:12Z" level=info msg="Cleaning up Runtime"
time="2024-05-02T13:42:12Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]"
time="2024-05-02T13:42:12Z" level=info msg="Successfully parsed arguments"
time="2024-05-02T13:42:12Z" level=info msg="Starting 'cleanup' for containerd"
time="2024-05-02T13:42:12Z" level=info msg="Loading config from /runtime/config-dir/config.toml"
time="2024-05-02T13:42:12Z" level=info msg="Flushing config to /runtime/config-dir/config.toml"
time="2024-05-02T13:42:12Z" level=info msg="Sending SIGHUP signal to containerd"
rpc error: code = Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer
and this is when containerd gets restarted:
May 02 13:05:44 node16.example.com systemd[1]: Stopped containerd container runtime.
May 02 13:05:44 node16.example.com systemd[1]: Started containerd container runtime.
May 02 13:08:58 node16.example.com systemd[1]: Stopped containerd container runtime.
May 02 13:08:58 node16.example.com systemd[1]: Started containerd container runtime.
May 02 13:42:18 node16.example.com systemd[1]: Stopped containerd container runtime.
May 02 13:42:18 node16.example.com systemd[1]: Started containerd container runtime.
May 02 13:42:36 node16.example.com systemd[1]: Stopped containerd container runtime.
May 02 13:42:36 node16.example.com systemd[1]: Started containerd container runtime.
May 02 13:42:54 node16.example.com systemd[1]: Stopped containerd container runtime.
May 02 13:42:54 node16.example.com systemd[1]: Started containerd container runtime.
May 02 13:44:51 node16.example.com systemd[1]: Stopped containerd container runtime.
May 02 13:44:51 node16.example.com systemd[1]: Started containerd container runtime.
This happens at the same time on all nodes where we have GPUs.
3. Steps to reproduce the issue
Not sure how to reproduce it, but it happens at least once every hour.
4. Information to attach (optional if deemed irrelevant)
- [ ] kubernetes pods status:
kubectl get pods -n OPERATOR_NAMESPACE - [ ] kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE - [ ] If a pod/ds is in an error state or pending state
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME - [ ] If a pod/ds is in an error state or pending state
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers - [ ] Output from running
nvidia-smifrom the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi - [ ] containerd logs
journalctl -u containerd > containerd.log
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: [email protected]
@tatodorov can you pass env RUNTIME_RESTART_MODE to none under toolkit.env in the ClusterPolicy and verify if this issue persists? The toolkit will reload containerd on applying nvidia specific runtime config, but this should not cause repeated restarts. We had few race condition issues addressed in the past with v1.6.8 of containerd, which were fixed later-on.
@shivamerla, thank you very much for the assistance! I've made the change and will monitor the behavior. This is a snippet of the ClusterPolicy:
toolkit:
enabled: true env :
- name: RUNTIME_RESTART_MODE
value: none
image: container-toolkit
imagePullPolicy: IfNotPresent
installDir: /usr/local/nvidia
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.
This issue has been open for over 90 days without recent updates, and the context may now be outdated.
Given that gpu-operator v23.9.2 is EOL now, I would encourage you to try latest version and see if you still see this issue.
If this issue is still relevant with the latest version of the NVIDIA GPU Operator, please feel free to reopen it or open a new one with updated details.