gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Back-off restarting failed container nvidia-device-plugin-ctr

Open A-Akhil opened this issue 1 year ago • 4 comments

When we are trying to install Kubernets(K8s) in DGX A100 server at the time of Helm install for nvidia-device-plugin we are getting the following error

kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE kube-flannel kube-flannel-ds-2ss2c 1/1 Running 1 (3d21h ago) 3d22h kube-flannel kube-flannel-ds-9cwh9 1/1 Running 0 3d22h kube-system coredns-787d4945fb-9rcpx 1/1 Running 0 3d22h kube-system coredns-787d4945fb-9scjh 1/1 Running 0 3d22h kube-system etcd-sybsramma-virtual-machine 1/1 Running 0 3d22h kube-system gpu-feature-discovery-1712918793-gpu-feature-discovery-dr6ht 1/1 Running 0 3d21h kube-system gpu-feature-discovery-1712918793-node-feature-discovery-marrffd 1/1 Running 0 3d21h kube-system gpu-feature-discovery-1712918793-node-feature-discovery-womw95r 1/1 Running 1 (3d21h ago) 3d21h root@sybsramma-virtual-machine:~# 3d22h kube-system kube-controller-manager-sybsramma-virtual-machine 1/1 Running 0 3d22h kube-system kube-proxy-hnb42 1/1 Running 0 3d22h kube-system kube-proxy-s7q7h 1/1 Running 1 (3d21h ago) 3d22h kube-system kube-scheduler-sybsramma-virtual-machine 1/1 Running 0 3d22h kube-system nvidia-device-plugin-1712918682-bs4vf 0/1 CrashLoopBackOff 1104 (23s ago) 3d21h

kubectl describe pod nvidia-device-plugin-1712918682-bs4vf -n kube-system

Name: nvidia-device-plugin-1712918682-bs4vf Namespace: kube-system Priority: 2000001000 Priority Class Name: system-node-critical Service Account: default Node: dgxa100/ Start Time: Fri, 12 Apr 2024 16:16:59 +0530 Labels: app.kubernetes.io/instance=nvidia-device-plugin-1712918682 app.kubernetes.io/name=nvidia-device-plugin controller-revision-hash=665b565fc7 pod-template-generation=1 Annotations: Status: Running IP:
IPs: IP:
Controlled By: DaemonSet/nvidia-device-plugin-1712918682 Containers: nvidia-device-plugin-ctr: Container ID: containerd://9ad6475f973adb6fb463acff145cb7609e0a2e728d12a0c4ae9cf77ed2201cde Image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0-rc.2 Image ID: nvcr.io/nvidia/k8s-device-plugin@sha256:0585da349f3cdca29747834e39ada56aed5e23ba363908fc526474d25aa61a75 Port: Host Port: Command: nvidia-device-plugin State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 1 Started: Tue, 16 Apr 2024 13:40:32 +0530 Finished: Tue, 16 Apr 2024 13:40:32 +0530 Ready: False Restart Count: 1100 Environment: MPS_ROOT: /run/nvidia/mps MIG_STRATEGY: single NVIDIA_MIG_MONITOR_DEVICES: all NVIDIA_VISIBLE_DEVICES: all NVIDIA_DRIVER_CAPABILITIES: compute,utility Mounts: /dev/shm from mps-shm (rw) /mps from mps-root (rw) /var/lib/kubelet/device-plugins from device-plugin (rw) /var/run/cdi from cdi-root (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r9php (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: device-plugin: Type: HostPath (bare host directory volume) Path: /var/lib/kubelet/device-plugins HostPathType: mps-root: Type: HostPath (bare host directory volume) Path: /run/nvidia/mps HostPathType: DirectoryOrCreate mps-shm: Type: HostPath (bare host directory volume) Path: /run/nvidia/mps/shm HostPathType: cdi-root: Type: HostPath (bare host directory volume) Path: /var/run/cdi HostPathType: DirectoryOrCreate kube-api-access-r9php: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: Tolerations: CriticalAddonsOnly op=Exists node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists nvidia.com/gpu:NoSchedule op=Exists Events: Type Reason Age From Message


Warning BackOff 3m20s (x25802 over 3d21h) kubelet Back-off restarting failed container nvidia-device-plugin-ctr in pod nvidia-device-plugin-1712918682-bs4vf_kube-system(7b6ec6ee-2aed-41b2-8b69-2975749172ec)

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): 20.04.1-Ubuntu
  • Kernel Version:5.15.0-102-generic
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):Containerd version 2
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):

kubectl version WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.15", GitCommit:"1649f592f1909b97aa3c2a0a8f968a3fd05a7b8b", GitTreeState:"clean", BuildDate:"2024-03-14T01:05:39Z", GoVersion:"go1.21.8", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.15", GitCommit:"1649f592f1909b97aa3c2a0a8f968a3fd05a7b8b", GitTreeState:"clean", BuildDate:"2024-03-14T00:54:27Z", GoVersion:"go1.21.8", Compiler:"gc", Platform:"linux/amd64"

  • GPU Operator Version: This is my output from my server NVIDIA-SMI 470.141.10 Driver Version: 470.141.10 CUDA Version: 11.4

A-Akhil avatar Apr 16 '24 08:04 A-Akhil

@A-Akhil What does the log message from the device-plugin pod say? Also please report this here as it is standalone device-plugin installation.

shivamerla avatar Apr 23 '24 03:04 shivamerla

@shivamerla
This is the command which we used to install the nvidia-device plugin using helm helm install
--version=0.15.0-rc.2
--generate-name
--namespace kube-system
--create-namespace
--set migStrategy=single
nvdp/nvidia-device-plugin

This is the log root@sybsramma-virtual-machine:~# kubectl logs nvidia-device-plugin-1712918682-bs4vf -n kube-system

I0417 03:40:28.204942 1 main.go:178] Starting FS watcher. I0417 03:40:28.205038 1 main.go:185] Starting OS watcher. I0417 03:40:28.205179 1 main.go:200] Starting Plugins. I0417 03:40:28.205216 1 main.go:257] Loading configuration. I0417 03:40:28.205615 1 main.go:265] Updating config with default resource matching patterns. I0417 03:40:28.205917 1 main.go:276] Running with config: { "version": "v1", "flags": { "migStrategy": "single", "failOnInitError": true, "mpsRoot": "/run/nvidia/mps", "nvidiaDriverRoot": "/", "gdsEnabled": false, "mofedEnabled": false, "useNodeFeatureAPI": null, "plugin": { "passDeviceSpecs": false, "deviceListStrategy": [ "envvar" ], "deviceIDStrategy": "uuid", "cdiAnnotationPrefix": "cdi.k8s.io/", "nvidiaCTKPath": "/usr/bin/nvidia-ctk", "containerDriverRoot": "/driver-root" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ], "mig": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} } } I0417 03:40:28.205925 1 main.go:279] Retrieving plugins. W0417 03:40:28.205971 1 factory.go:31] No valid resources detected, creating a null CDI handler I0417 03:40:28.205998 1 factory.go:104] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory I0417 03:40:28.206027 1 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found E0417 03:40:28.206033 1 factory.go:112] Incompatible platform detected E0417 03:40:28.206037 1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit? E0417 03:40:28.206041 1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites E0417 03:40:28.206046 1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start E0417 03:40:28.206051 1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes E0417 03:40:28.213668 1 main.go:132] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed root@sybsramma-virtual-machine:~#

and this is the describe ` root@sybsramma-virtual-machine:~# kubectl describe pod nvidia-device-plugin-1712918682-bs4vf -n kube-system Name: nvidia-device-plugin-1712918682-bs4vf Namespace: kube-system Priority: 2000001000 Priority Class Name: system-node-critical Service Account: default Node: dgxa100/172.16.0.32 Start Time: Fri, 12 Apr 2024 16:16:59 +0530 Labels: app.kubernetes.io/instance=nvidia-device-plugin-1712918682 app.kubernetes.io/name=nvidia-device-plugin controller-revision-hash=665b565fc7 pod-template-generation=1 Annotations: Status: Running IP: 10.244.1.47 IPs: IP: 10.244.1.47 Controlled By: DaemonSet/nvidia-device-plugin-1712918682 Containers: nvidia-device-plugin-ctr: Container ID: containerd://62c676b690d74f0591ba712abb5d0fb567c7ab0c9d65e56a188b5b51f9c65ade Image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0-rc.2 Image ID: nvcr.io/nvidia/k8s-device-plugin@sha256:0585da349f3cdca29747834e39ada56aed5e23ba363908fc526474d25aa61a75 Port: Host Port: Command: nvidia-device-plugin State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 1 Started: Wed, 17 Apr 2024 09:15:30 +0530 Finished: Wed, 17 Apr 2024 09:15:30 +0530 Ready: False Restart Count: 1330 Environment: MPS_ROOT: /run/nvidia/mps MIG_STRATEGY: single NVIDIA_MIG_MONITOR_DEVICES: all NVIDIA_VISIBLE_DEVICES: all NVIDIA_DRIVER_CAPABILITIES: compute,utility Mounts: /dev/shm from mps-shm (rw) /mps from mps-root (rw) /var/lib/kubelet/device-plugins from device-plugin (rw) /var/run/cdi from cdi-root (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r9php (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: device-plugin: Type: HostPath (bare host directory volume) Path: /var/lib/kubelet/device-plugins HostPathType: mps-root: Type: HostPath (bare host directory volume) Path: /run/nvidia/mps HostPathType: DirectoryOrCreate mps-shm: Type: HostPath (bare host directory volume) Path: /run/nvidia/mps/shm HostPathType: cdi-root: Type: HostPath (bare host directory volume) Path: /var/run/cdi HostPathType: DirectoryOrCreate kube-api-access-r9php: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: Tolerations: CriticalAddonsOnly op=Exists node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists nvidia.com/gpu:NoSchedule op=Exists Events: Type Reason Age From Message


Warning BackOff 4m13s (x31185 over 4d16h) kubelet Back-off restarting failed container nvidia-device-plugin-ctr in pod nvidia-device-plugin-1712918682-bs4vf_kube-system(7b6ec6ee-2aed-41b2-8b69-2975749172ec) root@sybsramma-virtual-machine:~#`

Kubernetes_Documentation.pdf

This is the documentation which i used to install the plugin In this we choosed mig enabled with same instance type for helm install

A-Akhil avatar Apr 23 '24 04:04 A-Akhil

are NVIDIA drivers and container-toolkit setup correctly on the node?

shivamerla avatar Apr 23 '24 05:04 shivamerla

@shivamerla yea its setup properly

A-Akhil avatar Apr 27 '24 04:04 A-Akhil

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

github-actions[bot] avatar Nov 05 '25 00:11 github-actions[bot]

This issue has been open for over 90 days without recent updates, and the context may now be outdated.

Given that this issue is 1.5 years old and there has been no updates since then, I would encourage you to try latest version and see if you still see this issue.

If this issue is still relevant with the latest version of the NVIDIA GPU Operator, please feel free to reopen it or open a new one with updated details.

rahulait avatar Nov 14 '25 07:11 rahulait