gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

containerd restarts at least once an hour

Open tatodorov opened this issue 1 year ago • 2 comments

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu22.04
  • Kernel Version: 5.15.0-101-generic
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd github.com/containerd/containerd v1.6.8 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): Kubernetes v1.24.6
  • GPU Operator Version: v23.9.2

2. Issue or feature description

containerd gets restarted randomly, but at least once every hour. It can't last for more than an hour without restart. Respectively, the following Pods also get restarted:

  • gpu-feature-discovery
  • nvidia-container-toolkit-daemonset
  • nvidia-cuda-validator
  • nvidia-dcgm-exporter
  • nvidia-device-plugin-daemonset
  • nvidia-driver-daemonset
  • nvidia-mig-manager
  • nvidia-operator-validator

Those are the last log events from all nvidia-container-toolkit-daemonset Pods

time="2024-05-02T13:08:53Z" level=info msg="Setting up runtime"
time="2024-05-02T13:08:53Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]"
time="2024-05-02T13:08:53Z" level=info msg="Successfully parsed arguments"
time="2024-05-02T13:08:53Z" level=info msg="Starting 'setup' for containerd"
time="2024-05-02T13:08:53Z" level=info msg="Loading config from /runtime/config-dir/config.toml"
time="2024-05-02T13:08:53Z" level=info msg="Flushing config to /runtime/config-dir/config.toml"
time="2024-05-02T13:08:53Z" level=info msg="Sending SIGHUP signal to containerd"
time="2024-05-02T13:08:53Z" level=info msg="Waiting for signal"
time="2024-05-02T13:42:12Z" level=info msg="Cleaning up Runtime"
time="2024-05-02T13:42:12Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]"
time="2024-05-02T13:42:12Z" level=info msg="Successfully parsed arguments"
time="2024-05-02T13:42:12Z" level=info msg="Starting 'cleanup' for containerd"
time="2024-05-02T13:42:12Z" level=info msg="Loading config from /runtime/config-dir/config.toml"
time="2024-05-02T13:42:12Z" level=info msg="Flushing config to /runtime/config-dir/config.toml"
time="2024-05-02T13:42:12Z" level=info msg="Sending SIGHUP signal to containerd"
rpc error: code = Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer

and this is when containerd gets restarted:

May 02 13:05:44 node16.example.com systemd[1]: Stopped containerd container runtime.
May 02 13:05:44 node16.example.com systemd[1]: Started containerd container runtime.
May 02 13:08:58 node16.example.com systemd[1]: Stopped containerd container runtime.
May 02 13:08:58 node16.example.com systemd[1]: Started containerd container runtime.
May 02 13:42:18 node16.example.com systemd[1]: Stopped containerd container runtime.
May 02 13:42:18 node16.example.com systemd[1]: Started containerd container runtime.
May 02 13:42:36 node16.example.com systemd[1]: Stopped containerd container runtime.
May 02 13:42:36 node16.example.com systemd[1]: Started containerd container runtime.
May 02 13:42:54 node16.example.com systemd[1]: Stopped containerd container runtime.
May 02 13:42:54 node16.example.com systemd[1]: Started containerd container runtime.
May 02 13:44:51 node16.example.com systemd[1]: Stopped containerd container runtime.
May 02 13:44:51 node16.example.com systemd[1]: Started containerd container runtime.

This happens at the same time on all nodes where we have GPUs.

3. Steps to reproduce the issue

Not sure how to reproduce it, but it happens at least once every hour.

4. Information to attach (optional if deemed irrelevant)

  • [ ] kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
  • [ ] kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
  • [ ] If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
  • [ ] If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
  • [ ] Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
  • [ ] containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

tatodorov avatar May 02 '24 15:05 tatodorov

@tatodorov can you pass env RUNTIME_RESTART_MODE to none under toolkit.env in the ClusterPolicy and verify if this issue persists? The toolkit will reload containerd on applying nvidia specific runtime config, but this should not cause repeated restarts. We had few race condition issues addressed in the past with v1.6.8 of containerd, which were fixed later-on.

shivamerla avatar May 17 '24 15:05 shivamerla

@shivamerla, thank you very much for the assistance! I've made the change and will monitor the behavior. This is a snippet of the ClusterPolicy:

toolkit:
  enabled: true env :
  - name: RUNTIME_RESTART_MODE
    value: none
  image: container-toolkit
  imagePullPolicy: IfNotPresent
  installDir: /usr/local/nvidia

tatodorov avatar May 17 '24 16:05 tatodorov

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

github-actions[bot] avatar Nov 05 '25 00:11 github-actions[bot]

This issue has been open for over 90 days without recent updates, and the context may now be outdated.

Given that gpu-operator v23.9.2 is EOL now, I would encourage you to try latest version and see if you still see this issue.

If this issue is still relevant with the latest version of the NVIDIA GPU Operator, please feel free to reopen it or open a new one with updated details.

rahulait avatar Nov 14 '25 04:11 rahulait