nvidia-device-plugin getting CrashLoopBackOff while installing using helm
1. Issue or feature description
I have created a multi-node k0s Kubernetes cluster using this blog https://www.padok.fr/en/blog/k0s-kubernetes-gpu but while deploying nvidia-device-plugin via helm https://github.com/NVIDIA/k8s-device-plugin#deployment-via-helm, I'm getting the status as CrashLoopBackOff
kube-system calico-kube-controllers-555bc4b957-q99v4 1/1 Running 34 (3d6h ago) 25d
kube-system calico-node-btnm7 1/1 Running 6 (3d6h ago) 25d
kube-system calico-node-hqtr4 1/1 Running 3 (7d2h ago) 25d
kube-system coredns-ddddfbd5c-jnxwm 1/1 Running 6 (3d6h ago) 25d
kube-system coredns-ddddfbd5c-pwgqd 1/1 Running 6 (3d6h ago) 25d
kube-system konnectivity-agent-bckg8 1/1 Running 2 (7d2h ago) 19d
kube-system konnectivity-agent-kvml7 1/1 Running 1 (3d6h ago) 7d2h
kube-system kube-proxy-5mbz6 1/1 Running 6 (3d6h ago) 25d
kube-system kube-proxy-rts4r 1/1 Running 3 (7d2h ago) 25d
kube-system metrics-server-7d7c4887f4-8tlt7 1/1 Running 8 (3d6h ago) 25d
nvidia-device-plugin nvdp-nvidia-device-plugin-2plxw 1/2 CrashLoopBackOff 2800 (
2. Steps to reproduce the issue
I have followed this blog https://www.padok.fr/en/blog/k0s-kubernetes-gpu
Download k0s binary curl -L "https://github.com/k0sproject/k0s/releases/download/v1.24.4%2Bk0s.0/k0s-v1.24.4+k0s.0-amd64" -o /tmp/k0s chmod +x /tmp/k0s Download k0sctl binary curl -L "https://github.com/k0sproject/k0sctl/releases/download/v0.13.2/k0sctl-linux-x64" -o /usr/local/bin/k0sctl chmod +x /usr/local/bin/k0sctl
Then you need to create a k0sctl.yaml config file: For a multi-node Kubernetes cluster
k0sctl.yaml file
apiVersion: k0sctl.k0sproject.io/v1beta1 kind: Cluster metadata: name: my-cluster spec: hosts: - role: controller localhost: enabled: true files: - name: containerd-config src: /tmp/containerd.toml dstDir: /etc/k0s/ perm: "0755" dirPerm: null - role: worker ssh: address: 43.88.62.134 user: user keyPath: .ssh/id_rsa files: - name: containerd-config src: /tmp/containerd.toml dstDir: /etc/k0s/ perm: "0755" dirPerm: null - role: worker ssh: address: 43.88.62.133 user: user keyPath: .ssh/id_rsa files: - name: containerd-config src: /tmp/containerd.toml dstDir: /etc/k0s/ perm: "0755" dirPerm: null k0s: version: 1.24.4+k0s.0 config: spec: network: provider: calico
/tmp/containerd.toml file
version = 2
[plugins] [plugins."io.containerd.grpc.v1.cri"] [plugins."io.containerd.grpc.v1.cri".containerd] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc] runtime_type = "io.containerd.runc.v2"
Then run the command: k0sctl apply --config /path/to/k0sctl.yaml
Deploy NVIDIA GPU Operator
values.yaml file
operator: defaultRuntime: containerd
toolkit: version: v1.10.0-ubuntu20.04 env: - name: CONTAINERD_CONFIG value: /etc/k0s/containerd.toml - name: CONTAINERD_SOCKET value: /run/k0s/containerd.sock - name: CONTAINERD_RUNTIME_CLASS value: nvidia - name: CONTAINERD_SET_AS_DEFAULT value: "true"
driver: manager: image: k8s-driver-manager repository: nvcr.io/nvidia/cloud-native version: v0.4.0 imagePullPolicy: IfNotPresent env: - name: ENABLE_AUTO_DRAIN value: "true" - name: DRAIN_USE_FORCE value: "true" - name: DRAIN_POD_SELECTOR_LABEL value: "" - name: DRAIN_TIMEOUT_SECONDS value: "0s" - name: DRAIN_DELETE_EMPTYDIR_DATA value: "true" repoConfig: configMapName: repo-config version: "495.29.05"
validator: version: "v1.11.0"
Install Helm
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3
&& chmod 700 get_helm.sh
&& ./get_helm.sh
Now, add the NVIDIA Helm repository:
helm repo add nvidia https://nvidia.github.io/gpu-operator
&& helm repo update
helm install --wait --generate-name
nvidia/gpu-operator
helm upgrade --install --namespace=gpu-operator --create-namespace --wait --values=values.yaml gpu-operator nvidia/gpu-operator
I'm confused. You are deploying both the operator and the k8s-device-plugin? The operator deploys the plugin as part of its installation. Is there a reason you are trying to deploy the plugin separately?
I'm getting an error while installing the operator - Error: template: gpu-operator/templates/clusterpolicy.yaml:62:18: executing "gpu-operator/templates/clusterpolicy.yaml" at <.Values.validator.repository>: nil pointer evaluating interface {}.repository
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.