k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

How to use the device plugin with new k8s 1.24 version?

Open Zigko opened this issue 2 years ago • 22 comments

I'm using this new version of Kubernetes (1.24) with containerd instead of docker engine... And because of that I can't deploy the Nvidia device plugin deamonset, so when will this tool be supported for another container runtime?

Zigko avatar May 06 '22 10:05 Zigko

The plugin is independent of any container runtime, but you need to make sure containerd itself is configured to use the nvidia-container-runtime under the hood.

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#containerd

klueska avatar May 06 '22 11:05 klueska

I have installer nvidia container toolkit, but still cant use the device plugin and I don't understand why

root@debian:~# ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 cuda-11.0.3-base-ubuntu20.04 nvidia-smi Fri May 6 15:28:22 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A | | 29% 34C P8 15W / 125W | 1MiB / 6144MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

I tried to deploy it with helm but the daemonset wont start

root@debian:~# kubectl get daemonset -A NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE kube-system calico-node 1 1 1 1 1 kubernetes.io/os=linux 37m kube-system kube-proxy 1 1 1 1 1 kubernetes.io/os=linux 38m kube-system nvidia-device-plugin-1651848989 0 0 0 0 0 36m

from kubectl describe node: System Info: Boot ID: 533b8601-234c-4992-885c-718211fd4570 Kernel Version: 5.10.0-13-amd64 OS Image: Debian GNU/Linux 11 (bullseye) Operating System: linux Architecture: amd64 Container Runtime Version: containerd://1.6.4 Kubelet Version: v1.24.0 Kube-Proxy Version: v1.24.0

Zigko avatar May 06 '22 15:05 Zigko

Sorry, I sent you the wrong link. Look for the documentation on that same website for configuring containerd for Kubernetes. You need to configure its CRI plugin to be aware of the NVIDIA container runtime. I’m not in front of my computer so I don’t have the direct link at hand unfortunately.

klueska avatar May 06 '22 16:05 klueska

You mean this part: https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#step-4-setup-nvidia-software

I have changed the config.toml to match the same configuration from the website and still doesn't work

Zigko avatar May 06 '22 16:05 Zigko

any update?
https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#step-4-setup-nvidia-software

I have changed config.toml and still doesn't work log: 2022/05/07 09:35:20 Loading NVML 2022/05/07 09:35:20 Failed to initialize NVML: could not load NVML library. 2022/05/07 09:35:20 If this is a GPU node, did you set the docker default runtime to nvidia? 2022/05/07 09:35:20 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites 2022/05/07 09:35:20 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start 2022/05/07 09:35:20 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes

FANHIDE avatar May 07 '22 10:05 FANHIDE

Good evening. Similar situation, does not work on version k8s 1.24. Writes 0 of 0 GPU. But on docker

am0ral93 avatar Jun 08 '22 15:06 am0ral93

@Zigko You have changed the /etc/containerd/config.toml as below, and restarted containerd?

--- config.toml 2020-12-17 19:13:03.242630735 +0000
+++ /etc/containerd/config.toml 2020-12-17 19:27:02.019027793 +0000
@@ -70,7 +70,7 @@
   ignore_image_defined_volumes = false
   [plugins."io.containerd.grpc.v1.cri".containerd]
      snapshotter = "overlayfs"
-      default_runtime_name = "runc"
+      default_runtime_name = "nvidia"
      no_pivot = false
      disable_snapshot_annotations = true
      discard_unpacked_layers = false
@@ -94,6 +94,15 @@
         privileged_without_host_devices = false
         base_runtime_spec = ""
         [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
+            SystemdCgroup = true
+       [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
+          privileged_without_host_devices = false
+          runtime_engine = ""
+          runtime_root = ""
+          runtime_type = "io.containerd.runc.v1"
+          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
+            BinaryName = "/usr/bin/nvidia-container-runtime"
+            SystemdCgroup = true
   [plugins."io.containerd.grpc.v1.cri".cni]
      bin_dir = "/opt/cni/bin"
      conf_dir = "/etc/cni/net.d"

klueska avatar Jun 08 '22 16:06 klueska

I have the following question. I use part of the machines on k8s version 1.23, one on 1.24. Earlier, you told me that docker no longer works on 1.24, only containerd. Should all machines be evaluated to 1.24.4 from 1.23.3 and disable the docker daemon on them? If it's enough to just install on 1.24.4 and save the settings for containerd and replace anyway 0-0 GPU.

am0ral93 avatar Jun 14 '22 17:06 am0ral93

sudo ctr run --rm -t docker.io/library/hello-world:latest hello-world

Hello from Docker! This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:

  1. The Docker client contacted the Docker daemon.
  2. The Docker daemon pulled the "hello-world" image from the Docker Hub. (amd64)
  3. The Docker daemon created a new container from that image which runs the executable that produces the output you are currently reading.
  4. The Docker daemon streamed that output to the Docker client, which sent it to your terminal.

To try something more ambitious, you can run an Ubuntu container with: $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID: https://hub.docker.com/

For more examples and ideas, visit: https://docs.docker.com/get-started/

test passes for me, but it still displays 0-0 :(

kubectl describe nodes | tr -d '\000' | sed -n -e '/^Name/,/Roles/p' -e '/^Capacity/,/Allocatable/p' -e '/^Allocated resources/,/Events/p' | grep -e Name -e nvidia.com | perl -pe 's/\n//' | perl -pe 's/Name:/\n/g' | sed 's/nvidia.com/gpu:?//g' | sed '1s/^/Node Available(GPUs) Used(GPUs)/' | sed 's/$/ 0 0 0/' | awk '{print $1, $2, $3}' | column -t

Node Available(GPUs) Used(GPUs) vpc1 1 1 vpc11 0 0 vpc2 2 1 vpc3 0 0 vpc4 2 1

kubectl get nodes

NAME STATUS ROLES AGE VERSION vpc1 Ready control-plane,master 125d v1.23.3 vpc11 NotReady 21m v1.23.4 vpc2 Ready control-plane,master 125d v1.23.3 vpc3 Ready 7d4h v1.24.1 vpc4 Ready control-plane,master 124d v1.23.3

am0ral93 avatar Jun 14 '22 18:06 am0ral93

Hello, at first I couldn't use the device plugin on the 1.24.0 version and went back to version 1.23.6 where I was already using contained. Some weeks later I tried to update to the newer version at the time (1.24.1) and the device plugin works fine for me with the containerd. If you want to keep using docker stay with 1.23 version, but if you're going to update the cluster version I recommend removing docker and using containerd.

Zigko avatar Jun 15 '22 13:06 Zigko

I don't know how containerd works with jhub. I tried downgrading to version 1.23.3 these 2 vpc's but they are still touchable on Docker GPU 0-0. I would not complicate things if it worked correctly.

am0ral93 avatar Jun 15 '22 15:06 am0ral93

Have you installed NVIDIA Container Toolkit and changed docker configuration to run with Nvidia? At /etc/docker/daemon.json

{
   "default-runtime": "nvidia",
   "runtimes": {
      "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
      }
   }
}

Zigko avatar Jun 15 '22 15:06 Zigko

Вы установили NVIDIA Container Toolkit и изменили конфигурацию докера для работы с Nvidia? В/etc/docker/daemon.json

{
   "default-runtime": "nvidia",
   "runtimes": {
      "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
      }
   }
}

of course yes. At the beginning of 1.23.3 everything worked, later I upgraded 2 machines to 1.24.1 and they stopped working. Full formatting didn't fix the issue for rolling back to 1.23.3

am0ral93 avatar Jun 15 '22 16:06 am0ral93

can you show the output of kubectl describe node and nvida-smi?

Zigko avatar Jun 15 '22 16:06 Zigko

can you show the output of kubectl describe node and nvida-smi?

kubectl describe node vpc11

Name: vpc11 Roles: Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/arch=amd64 kubernetes.io/hostname=vpc11 kubernetes.io/os=linux Annotations: csi.volume.kubernetes.io/nodeid: {"rook-ceph.cephfs.csi.ceph.com":"vpc11","rook-ceph.rbd.csi.ceph.com":"vpc11"} kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock node.alpha.kubernetes.io/ttl: 0 projectcalico.org/IPv4Address: 10.0.70.101/24 projectcalico.org/IPv4IPIPTunnelAddr: 192.168.65.0 volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Tue, 14 Jun 2022 20:43:57 +0300 Taints: node.kubernetes.io/unreachable:NoExecute node.kubernetes.io/unreachable:NoSchedule Unschedulable: false Lease: HolderIdentity: vpc11 AcquireTime: RenewTime: Tue, 14 Jun 2022 20:48:22 +0300 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message


NetworkUnavailable False Tue, 14 Jun 2022 20:44:02 +0300 Tue, 14 Jun 2022 20:44:02 +0300 CalicoIsUp Calico is running on this node MemoryPressure Unknown Tue, 14 Jun 2022 20:47:01 +0300 Tue, 14 Jun 2022 20:49:03 +0300 NodeStatusUnknown Kubelet stopped posting node status. DiskPressure Unknown Tue, 14 Jun 2022 20:47:01 +0300 Tue, 14 Jun 2022 20:49:03 +0300 NodeStatusUnknown Kubelet stopped posting node status. PIDPressure Unknown Tue, 14 Jun 2022 20:47:01 +0300 Tue, 14 Jun 2022 20:49:03 +0300 NodeStatusUnknown Kubelet stopped posting node status. Ready Unknown Tue, 14 Jun 2022 20:47:01 +0300 Tue, 14 Jun 2022 20:49:03 +0300 NodeStatusUnknown Kubelet stopped posting node status. Addresses: InternalIP: 10.0.70.101 Hostname: vpc11 Capacity: cpu: 12 ephemeral-storage: 959200352Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 131636544Ki pods: 110 Allocatable: cpu: 12 ephemeral-storage: 883999042940 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 131534144Ki pods: 110 System Info: Machine ID: 7041e54cf6e14a00bdc3c03994890003 System UUID: d95d7640-8c25-cd49-8464-aa80693c55f7 Boot ID: d4bae2a6-b521-4795-8676-59941f9af0eb Kernel Version: 5.4.0-117-generic OS Image: Ubuntu 20.04.2 LTS Operating System: linux Architecture: amd64 Container Runtime Version: docker://20.10.12 Kubelet Version: v1.23.4 Kube-Proxy Version: v1.23.4 PodCIDR: 192.168.4.0/24 PodCIDRs: 192.168.4.0/24 Non-terminated Pods: (6 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age


jhub continuous-image-puller-mxrbm 0 (0%) 0 (0%) 0 (0%) 0 (0%) 24h kube-system calico-node-rv7s4 250m (2%) 0 (0%) 0 (0%) 0 (0%) 24h kube-system kube-proxy-rt5bz 0 (0%) 0 (0%) 0 (0%) 0 (0%) 24h kube-system nvidia-device-plugin-daemonset-srgp4 0 (0%) 0 (0%) 0 (0%) 0 (0%) 24h rook-ceph csi-cephfsplugin-mjrvx 0 (0%) 0 (0%) 0 (0%) 0 (0%) 24h rook-ceph csi-rbdplugin-495rc 0 (0%) 0 (0%) 0 (0%) 0 (0%) 24h Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits


cpu 250m (2%) 0 (0%) memory 0 (0%) 0 (0%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) Events:

nvidia-smi

Wed Jun 15 20:50:19 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:15:00.0 Off | N/A | | 27% 63C P0 71W / 250W | 0MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... Off | 00000000:21:00.0 Off | N/A | | 42% 72C P0 73W / 250W | 0MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

am0ral93 avatar Jun 15 '22 17:06 am0ral93

and logs from nvidia device plugin?

Zigko avatar Jun 15 '22 18:06 Zigko

and logs from nvidia device plugin?

kube-test-container-584687df5f-vwz7k 1/1 Running 4 (24d ago) 85d nvidia-device-plugin-1655230975-22s26 1/1 Running 0 37h nvidia-device-plugin-1655230975-jnkgm 1/1 Running 0 37h nvidia-device-plugin-1655230975-kswvx 1/1 Running 0 37h nvidia-device-plugin-1655230975-wvvnp 0/1 CrashLoopBackOff 438 (3m9s ago) 37h proxy-5c9494449-mdbzs 1/1 Running 0 39h

kubectl logs nvidia-device-plugin-1655230975-wvvnp

2022/06/16 07:21:18 Starting FS watcher. 2022/06/16 07:21:18 Starting OS watcher. 2022/06/16 07:21:18 Starting Plugins. 2022/06/16 07:21:18 Loading configuration. 2022/06/16 07:21:18 Initializing NVML. 2022/06/16 07:21:18 Failed to initialize NVML: could not load NVML library. 2022/06/16 07:21:18 If this is a GPU node, did you set the docker default runtime to nvidia? 2022/06/16 07:21:18 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites 2022/06/16 07:21:18 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start 2022/06/16 07:21:18 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes 2022/06/16 07:21:18 Error: error starting plugins: failed to initialize NVML: could not load NVML library

kubectl logs nvidia-device-plugin-1655230975-kswvx

2022/06/14 18:23:11 Starting FS watcher. 2022/06/14 18:23:11 Starting OS watcher. 2022/06/14 18:23:11 Starting Plugins. 2022/06/14 18:23:11 Loading configuration. 2022/06/14 18:23:11 Initializing NVML. 2022/06/14 18:23:11 Updating config with default resource matching patterns. 2022/06/14 18:23:11 Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": true, "nvidiaDriverRoot": "/", "plugin": { "passDeviceSpecs": false, "deviceListStrategy": "envvar", "deviceIDStrategy": "uuid" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} } } 2022/06/14 18:23:11 Retreiving plugins. 2022/06/14 18:23:11 Starting GRPC server for 'nvidia.com/gpu' 2022/06/14 18:23:11 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock 2022/06/14 18:23:11 Registered device plugin for 'nvidia.com/gpu' with Kubelet

kubectl logs nvidia-device-plugin-1655230975-jnkgm

2022/06/14 18:23:13 Starting FS watcher. 2022/06/14 18:23:13 Starting OS watcher. 2022/06/14 18:23:13 Starting Plugins. 2022/06/14 18:23:13 Loading configuration. 2022/06/14 18:23:13 Initializing NVML. 2022/06/14 18:23:13 Updating config with default resource matching patterns. 2022/06/14 18:23:13 Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": true, "nvidiaDriverRoot": "/", "plugin": { "passDeviceSpecs": false, "deviceListStrategy": "envvar", "deviceIDStrategy": "uuid" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} } } 2022/06/14 18:23:13 Retreiving plugins. 2022/06/14 18:23:13 Starting GRPC server for 'nvidia.com/gpu' 2022/06/14 18:23:13 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock 2022/06/14 18:23:13 Registered device plugin for 'nvidia.com/gpu' with Kubelet

kubectl logs nvidia-device-plugin-1655230975-22s26

2022/06/14 18:23:12 Starting FS watcher. 2022/06/14 18:23:12 Starting OS watcher. 2022/06/14 18:23:12 Starting Plugins. 2022/06/14 18:23:12 Loading configuration. 2022/06/14 18:23:12 Initializing NVML. 2022/06/14 18:23:12 Updating config with default resource matching patterns. 2022/06/14 18:23:12 Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": true, "nvidiaDriverRoot": "/", "plugin": { "passDeviceSpecs": false, "deviceListStrategy": "envvar", "deviceIDStrategy": "uuid" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} } } 2022/06/14 18:23:12 Retreiving plugins. 2022/06/14 18:23:12 Starting GRPC server for 'nvidia.com/gpu' 2022/06/14 18:23:12 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock 2022/06/14 18:23:12 Registered device plugin for 'nvidia.com/gpu' with Kubelet

am0ral93 avatar Jun 16 '22 07:06 am0ral93

How many nodes do you have? One of them is missing NVML library... try to find the solution to this

Zigko avatar Jun 16 '22 13:06 Zigko

How many nodes do you have? One of them is missing NVML library... try to find the solution to this

5 nodes, but for some reason only 4 are visible and nvidia-smi is visible everywhere

am0ral93 avatar Jun 17 '22 07:06 am0ral93

How many nodes do you have? One of them is missing NVML library... try to find the solution to this

On this occasion, I do not understand what to do next.

am0ral93 avatar Jun 17 '22 11:06 am0ral93

#332

zengzhengrong avatar Sep 07 '22 14:09 zengzhengrong

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Feb 28 '24 04:02 github-actions[bot]