Back-off restarting failed container nvidia-device-plugin-ctr
When we are trying to install Kubernets(K8s) in DGX A100 server at the time of Helm install for nvidia-device-plugin we are getting the following error
kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-flannel kube-flannel-ds-2ss2c 1/1 Running 1 (3d21h ago) 3d22h
kube-flannel kube-flannel-ds-9cwh9 1/1 Running 0
3d22h
kube-system coredns-787d4945fb-9rcpx 1/1 Running 0
3d22h
kube-system coredns-787d4945fb-9scjh 1/1 Running 0
3d22h
kube-system etcd-sybsramma-virtual-machine 1/1 Running 0
3d22h
kube-system gpu-feature-discovery-1712918793-gpu-feature-discovery-dr6ht 1/1 Running 0
3d21h
kube-system gpu-feature-discovery-1712918793-node-feature-discovery-marrffd 1/1 Running 0
3d21h
kube-system gpu-feature-discovery-1712918793-node-feature-discovery-womw95r 1/1 Running 1 (3d21h ago) 3d21h
root@sybsramma-virtual-machine:~#
3d22h
kube-system kube-controller-manager-sybsramma-virtual-machine 1/1 Running 0
3d22h
kube-system kube-proxy-hnb42 1/1 Running 0
3d22h
kube-system kube-proxy-s7q7h 1/1 Running 1 (3d21h ago) 3d22h
kube-system kube-scheduler-sybsramma-virtual-machine 1/1 Running 0
3d22h
kube-system nvidia-device-plugin-1712918682-bs4vf 0/1 CrashLoopBackOff 1104 (23s ago) 3d21h
kubectl describe pod nvidia-device-plugin-1712918682-bs4vf -n kube-system
Name: nvidia-device-plugin-1712918682-bs4vf
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: default
Node: dgxa100/
Start Time: Fri, 12 Apr 2024 16:16:59 +0530
Labels: app.kubernetes.io/instance=nvidia-device-plugin-1712918682
app.kubernetes.io/name=nvidia-device-plugin
controller-revision-hash=665b565fc7
pod-template-generation=1
Annotations:
IPs:
IP:
Controlled By: DaemonSet/nvidia-device-plugin-1712918682
Containers:
nvidia-device-plugin-ctr:
Container ID: containerd://9ad6475f973adb6fb463acff145cb7609e0a2e728d12a0c4ae9cf77ed2201cde
Image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0-rc.2
Image ID: nvcr.io/nvidia/k8s-device-plugin@sha256:0585da349f3cdca29747834e39ada56aed5e23ba363908fc526474d25aa61a75
Port:
Warning BackOff 3m20s (x25802 over 3d21h) kubelet Back-off restarting failed container nvidia-device-plugin-ctr in pod nvidia-device-plugin-1712918682-bs4vf_kube-system(7b6ec6ee-2aed-41b2-8b69-2975749172ec)
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): 20.04.1-Ubuntu
- Kernel Version:5.15.0-102-generic
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):Containerd version 2
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):
kubectl version WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.15", GitCommit:"1649f592f1909b97aa3c2a0a8f968a3fd05a7b8b", GitTreeState:"clean", BuildDate:"2024-03-14T01:05:39Z", GoVersion:"go1.21.8", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.15", GitCommit:"1649f592f1909b97aa3c2a0a8f968a3fd05a7b8b", GitTreeState:"clean", BuildDate:"2024-03-14T00:54:27Z", GoVersion:"go1.21.8", Compiler:"gc", Platform:"linux/amd64"
- GPU Operator Version: This is my output from my server NVIDIA-SMI 470.141.10 Driver Version: 470.141.10 CUDA Version: 11.4
@A-Akhil What does the log message from the device-plugin pod say? Also please report this here as it is standalone device-plugin installation.
@shivamerla
This is the command which we used to install the nvidia-device plugin using helm
helm install
--version=0.15.0-rc.2
--generate-name
--namespace kube-system
--create-namespace
--set migStrategy=single
nvdp/nvidia-device-plugin
This is the log
root@sybsramma-virtual-machine:~# kubectl logs nvidia-device-plugin-1712918682-bs4vf -n kube-system
I0417 03:40:28.204942 1 main.go:178] Starting FS watcher. I0417 03:40:28.205038 1 main.go:185] Starting OS watcher. I0417 03:40:28.205179 1 main.go:200] Starting Plugins. I0417 03:40:28.205216 1 main.go:257] Loading configuration. I0417 03:40:28.205615 1 main.go:265] Updating config with default resource matching patterns. I0417 03:40:28.205917 1 main.go:276] Running with config: { "version": "v1", "flags": { "migStrategy": "single", "failOnInitError": true, "mpsRoot": "/run/nvidia/mps", "nvidiaDriverRoot": "/", "gdsEnabled": false, "mofedEnabled": false, "useNodeFeatureAPI": null, "plugin": { "passDeviceSpecs": false, "deviceListStrategy": [ "envvar" ], "deviceIDStrategy": "uuid", "cdiAnnotationPrefix": "cdi.k8s.io/", "nvidiaCTKPath": "/usr/bin/nvidia-ctk", "containerDriverRoot": "/driver-root" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ], "mig": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} } } I0417 03:40:28.205925 1 main.go:279] Retrieving plugins. W0417 03:40:28.205971 1 factory.go:31] No valid resources detected, creating a null CDI handler I0417 03:40:28.205998 1 factory.go:104] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory I0417 03:40:28.206027 1 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found E0417 03:40:28.206033 1 factory.go:112] Incompatible platform detected E0417 03:40:28.206037 1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit? E0417 03:40:28.206041 1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites E0417 03:40:28.206046 1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start E0417 03:40:28.206051 1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes E0417 03:40:28.213668 1 main.go:132] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed root@sybsramma-virtual-machine:~#
and this is the describe
`
root@sybsramma-virtual-machine:~# kubectl describe pod nvidia-device-plugin-1712918682-bs4vf -n kube-system
Name: nvidia-device-plugin-1712918682-bs4vf
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: default
Node: dgxa100/172.16.0.32
Start Time: Fri, 12 Apr 2024 16:16:59 +0530
Labels: app.kubernetes.io/instance=nvidia-device-plugin-1712918682
app.kubernetes.io/name=nvidia-device-plugin
controller-revision-hash=665b565fc7
pod-template-generation=1
Annotations:
Warning BackOff 4m13s (x31185 over 4d16h) kubelet Back-off restarting failed container nvidia-device-plugin-ctr in pod nvidia-device-plugin-1712918682-bs4vf_kube-system(7b6ec6ee-2aed-41b2-8b69-2975749172ec) root@sybsramma-virtual-machine:~#`
This is the documentation which i used to install the plugin In this we choosed mig enabled with same instance type for helm install
are NVIDIA drivers and container-toolkit setup correctly on the node?
@shivamerla yea its setup properly
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.
This issue has been open for over 90 days without recent updates, and the context may now be outdated.
Given that this issue is 1.5 years old and there has been no updates since then, I would encourage you to try latest version and see if you still see this issue.
If this issue is still relevant with the latest version of the NVIDIA GPU Operator, please feel free to reopen it or open a new one with updated details.