k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

nvidia-device-plugin getting CrashLoopBackOff while installing using helm

Open captainsk7 opened this issue 2 years ago • 3 comments

1. Issue or feature description

I have created a multi-node k0s Kubernetes cluster using this blog https://www.padok.fr/en/blog/k0s-kubernetes-gpu but while deploying nvidia-device-plugin via helm https://github.com/NVIDIA/k8s-device-plugin#deployment-via-helm, I'm getting the status as CrashLoopBackOff

kube-system calico-kube-controllers-555bc4b957-q99v4 1/1 Running 34 (3d6h ago) 25d kube-system calico-node-btnm7 1/1 Running 6 (3d6h ago) 25d kube-system calico-node-hqtr4 1/1 Running 3 (7d2h ago) 25d kube-system coredns-ddddfbd5c-jnxwm 1/1 Running 6 (3d6h ago) 25d kube-system coredns-ddddfbd5c-pwgqd 1/1 Running 6 (3d6h ago) 25d kube-system konnectivity-agent-bckg8 1/1 Running 2 (7d2h ago) 19d kube-system konnectivity-agent-kvml7 1/1 Running 1 (3d6h ago) 7d2h kube-system kube-proxy-5mbz6 1/1 Running 6 (3d6h ago) 25d kube-system kube-proxy-rts4r 1/1 Running 3 (7d2h ago) 25d kube-system metrics-server-7d7c4887f4-8tlt7 1/1 Running 8 (3d6h ago) 25d nvidia-device-plugin nvdp-nvidia-device-plugin-2plxw 1/2 CrashLoopBackOff 2800 ( ago) 12d nvidia-device-plugin nvdp-nvidia-device-plugin-qprsf 1/2 CrashLoopBackOff 2784 (94s ago) 12d nvidia-device-plugin nvidia-device-plugin-788xx 0/1 CrashLoopBackOff 3183 ( ago) 13d nvidia-device-plugin nvidia-device-plugin-pwj4k 0/1 CrashLoopBackOff 3168 (24s ago) 13d

2. Steps to reproduce the issue

I have followed this blog https://www.padok.fr/en/blog/k0s-kubernetes-gpu

Download k0s binary curl -L "https://github.com/k0sproject/k0s/releases/download/v1.24.4%2Bk0s.0/k0s-v1.24.4+k0s.0-amd64" -o /tmp/k0s chmod +x /tmp/k0s Download k0sctl binary curl -L "https://github.com/k0sproject/k0sctl/releases/download/v0.13.2/k0sctl-linux-x64" -o /usr/local/bin/k0sctl chmod +x /usr/local/bin/k0sctl

Then you need to create a k0sctl.yaml config file: For a multi-node Kubernetes cluster

k0sctl.yaml file

apiVersion: k0sctl.k0sproject.io/v1beta1 kind: Cluster metadata: name: my-cluster spec: hosts: - role: controller localhost: enabled: true files: - name: containerd-config src: /tmp/containerd.toml dstDir: /etc/k0s/ perm: "0755" dirPerm: null - role: worker ssh: address: 43.88.62.134 user: user keyPath: .ssh/id_rsa files: - name: containerd-config src: /tmp/containerd.toml dstDir: /etc/k0s/ perm: "0755" dirPerm: null - role: worker ssh: address: 43.88.62.133 user: user keyPath: .ssh/id_rsa files: - name: containerd-config src: /tmp/containerd.toml dstDir: /etc/k0s/ perm: "0755" dirPerm: null k0s: version: 1.24.4+k0s.0 config: spec: network: provider: calico

/tmp/containerd.toml file

version = 2

[plugins] [plugins."io.containerd.grpc.v1.cri"] [plugins."io.containerd.grpc.v1.cri".containerd] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc] runtime_type = "io.containerd.runc.v2"

Then run the command: k0sctl apply --config /path/to/k0sctl.yaml

Deploy NVIDIA GPU Operator

values.yaml file

operator: defaultRuntime: containerd

toolkit: version: v1.10.0-ubuntu20.04 env: - name: CONTAINERD_CONFIG value: /etc/k0s/containerd.toml - name: CONTAINERD_SOCKET value: /run/k0s/containerd.sock - name: CONTAINERD_RUNTIME_CLASS value: nvidia - name: CONTAINERD_SET_AS_DEFAULT value: "true"

driver: manager: image: k8s-driver-manager repository: nvcr.io/nvidia/cloud-native version: v0.4.0 imagePullPolicy: IfNotPresent env: - name: ENABLE_AUTO_DRAIN value: "true" - name: DRAIN_USE_FORCE value: "true" - name: DRAIN_POD_SELECTOR_LABEL value: "" - name: DRAIN_TIMEOUT_SECONDS value: "0s" - name: DRAIN_DELETE_EMPTYDIR_DATA value: "true" repoConfig: configMapName: repo-config version: "495.29.05"

validator: version: "v1.11.0"

Install Helm

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3
&& chmod 700 get_helm.sh
&& ./get_helm.sh

Now, add the NVIDIA Helm repository:

helm repo add nvidia https://nvidia.github.io/gpu-operator
&& helm repo update

helm install --wait --generate-name
nvidia/gpu-operator

helm upgrade --install --namespace=gpu-operator --create-namespace --wait --values=values.yaml gpu-operator nvidia/gpu-operator

captainsk7 avatar Jan 02 '23 11:01 captainsk7

I'm confused. You are deploying both the operator and the k8s-device-plugin? The operator deploys the plugin as part of its installation. Is there a reason you are trying to deploy the plugin separately?

klueska avatar Jan 03 '23 12:01 klueska

I'm getting an error while installing the operator - Error: template: gpu-operator/templates/clusterpolicy.yaml:62:18: executing "gpu-operator/templates/clusterpolicy.yaml" at <.Values.validator.repository>: nil pointer evaluating interface {}.repository

captainsk7 avatar Jan 03 '23 17:01 captainsk7

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Feb 28 '24 04:02 github-actions[bot]