k8s-device-plugin
k8s-device-plugin copied to clipboard
How to use the device plugin with new k8s 1.24 version?
I'm using this new version of Kubernetes (1.24) with containerd instead of docker engine... And because of that I can't deploy the Nvidia device plugin deamonset, so when will this tool be supported for another container runtime?
The plugin is independent of any container runtime, but you need to make sure containerd itself is configured to use the nvidia-container-runtime
under the hood.
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#containerd
I have installer nvidia container toolkit, but still cant use the device plugin and I don't understand why
root@debian:~# ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 cuda-11.0.3-base-ubuntu20.04 nvidia-smi Fri May 6 15:28:22 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A | | 29% 34C P8 15W / 125W | 1MiB / 6144MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
I tried to deploy it with helm but the daemonset wont start
root@debian:~# kubectl get daemonset -A
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-system calico-node 1 1 1 1 1 kubernetes.io/os=linux 37m
kube-system kube-proxy 1 1 1 1 1 kubernetes.io/os=linux 38m
kube-system nvidia-device-plugin-1651848989 0 0 0 0 0
from kubectl describe node: System Info: Boot ID: 533b8601-234c-4992-885c-718211fd4570 Kernel Version: 5.10.0-13-amd64 OS Image: Debian GNU/Linux 11 (bullseye) Operating System: linux Architecture: amd64 Container Runtime Version: containerd://1.6.4 Kubelet Version: v1.24.0 Kube-Proxy Version: v1.24.0
Sorry, I sent you the wrong link. Look for the documentation on that same website for configuring containerd for Kubernetes. You need to configure its CRI plugin to be aware of the NVIDIA container runtime. I’m not in front of my computer so I don’t have the direct link at hand unfortunately.
You mean this part: https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#step-4-setup-nvidia-software
I have changed the config.toml to match the same configuration from the website and still doesn't work
any update?
https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#step-4-setup-nvidia-software
I have changed config.toml and still doesn't work
log:
2022/05/07 09:35:20 Loading NVML 2022/05/07 09:35:20 Failed to initialize NVML: could not load NVML library. 2022/05/07 09:35:20 If this is a GPU node, did you set the docker default runtime to
nvidia? 2022/05/07 09:35:20 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites 2022/05/07 09:35:20 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start 2022/05/07 09:35:20 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
Good evening. Similar situation, does not work on version k8s 1.24. Writes 0 of 0 GPU. But on docker
@Zigko You have changed the /etc/containerd/config.toml
as below, and restarted containerd
?
--- config.toml 2020-12-17 19:13:03.242630735 +0000
+++ /etc/containerd/config.toml 2020-12-17 19:27:02.019027793 +0000
@@ -70,7 +70,7 @@
ignore_image_defined_volumes = false
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs"
- default_runtime_name = "runc"
+ default_runtime_name = "nvidia"
no_pivot = false
disable_snapshot_annotations = true
discard_unpacked_layers = false
@@ -94,6 +94,15 @@
privileged_without_host_devices = false
base_runtime_spec = ""
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
+ SystemdCgroup = true
+ [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
+ privileged_without_host_devices = false
+ runtime_engine = ""
+ runtime_root = ""
+ runtime_type = "io.containerd.runc.v1"
+ [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
+ BinaryName = "/usr/bin/nvidia-container-runtime"
+ SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
I have the following question. I use part of the machines on k8s version 1.23, one on 1.24. Earlier, you told me that docker no longer works on 1.24, only containerd. Should all machines be evaluated to 1.24.4 from 1.23.3 and disable the docker daemon on them? If it's enough to just install on 1.24.4 and save the settings for containerd and replace anyway 0-0 GPU.
sudo ctr run --rm -t docker.io/library/hello-world:latest hello-world
Hello from Docker! This message shows that your installation appears to be working correctly.
To generate this message, Docker took the following steps:
- The Docker client contacted the Docker daemon.
- The Docker daemon pulled the "hello-world" image from the Docker Hub. (amd64)
- The Docker daemon created a new container from that image which runs the executable that produces the output you are currently reading.
- The Docker daemon streamed that output to the Docker client, which sent it to your terminal.
To try something more ambitious, you can run an Ubuntu container with: $ docker run -it ubuntu bash
Share images, automate workflows, and more with a free Docker ID: https://hub.docker.com/
For more examples and ideas, visit: https://docs.docker.com/get-started/
test passes for me, but it still displays 0-0 :(
kubectl describe nodes | tr -d '\000' | sed -n -e '/^Name/,/Roles/p' -e '/^Capacity/,/Allocatable/p' -e '/^Allocated resources/,/Events/p' | grep -e Name -e nvidia.com | perl -pe 's/\n//' | perl -pe 's/Name:/\n/g' | sed 's/nvidia.com/gpu:?//g' | sed '1s/^/Node Available(GPUs) Used(GPUs)/' | sed 's/$/ 0 0 0/' | awk '{print $1, $2, $3}' | column -t
Node Available(GPUs) Used(GPUs) vpc1 1 1 vpc11 0 0 vpc2 2 1 vpc3 0 0 vpc4 2 1
kubectl get nodes
NAME STATUS ROLES AGE VERSION
vpc1 Ready control-plane,master 125d v1.23.3
vpc11 NotReady
Hello, at first I couldn't use the device plugin on the 1.24.0 version and went back to version 1.23.6 where I was already using contained. Some weeks later I tried to update to the newer version at the time (1.24.1) and the device plugin works fine for me with the containerd. If you want to keep using docker stay with 1.23 version, but if you're going to update the cluster version I recommend removing docker and using containerd.
I don't know how containerd works with jhub. I tried downgrading to version 1.23.3 these 2 vpc's but they are still touchable on Docker GPU 0-0. I would not complicate things if it worked correctly.
Have you installed NVIDIA Container Toolkit and changed docker configuration to run with Nvidia?
At /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
Вы установили NVIDIA Container Toolkit и изменили конфигурацию докера для работы с Nvidia? В
/etc/docker/daemon.json
{ "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } }
of course yes. At the beginning of 1.23.3 everything worked, later I upgraded 2 machines to 1.24.1 and they stopped working. Full formatting didn't fix the issue for rolling back to 1.23.3
can you show the output of
kubectl describe node
and
nvida-smi
?
can you show the output of
kubectl describe node
andnvida-smi
?
kubectl describe node vpc11
Name: vpc11
Roles:
NetworkUnavailable False Tue, 14 Jun 2022 20:44:02 +0300 Tue, 14 Jun 2022 20:44:02 +0300 CalicoIsUp Calico is running on this node MemoryPressure Unknown Tue, 14 Jun 2022 20:47:01 +0300 Tue, 14 Jun 2022 20:49:03 +0300 NodeStatusUnknown Kubelet stopped posting node status. DiskPressure Unknown Tue, 14 Jun 2022 20:47:01 +0300 Tue, 14 Jun 2022 20:49:03 +0300 NodeStatusUnknown Kubelet stopped posting node status. PIDPressure Unknown Tue, 14 Jun 2022 20:47:01 +0300 Tue, 14 Jun 2022 20:49:03 +0300 NodeStatusUnknown Kubelet stopped posting node status. Ready Unknown Tue, 14 Jun 2022 20:47:01 +0300 Tue, 14 Jun 2022 20:49:03 +0300 NodeStatusUnknown Kubelet stopped posting node status. Addresses: InternalIP: 10.0.70.101 Hostname: vpc11 Capacity: cpu: 12 ephemeral-storage: 959200352Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 131636544Ki pods: 110 Allocatable: cpu: 12 ephemeral-storage: 883999042940 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 131534144Ki pods: 110 System Info: Machine ID: 7041e54cf6e14a00bdc3c03994890003 System UUID: d95d7640-8c25-cd49-8464-aa80693c55f7 Boot ID: d4bae2a6-b521-4795-8676-59941f9af0eb Kernel Version: 5.4.0-117-generic OS Image: Ubuntu 20.04.2 LTS Operating System: linux Architecture: amd64 Container Runtime Version: docker://20.10.12 Kubelet Version: v1.23.4 Kube-Proxy Version: v1.23.4 PodCIDR: 192.168.4.0/24 PodCIDRs: 192.168.4.0/24 Non-terminated Pods: (6 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
jhub continuous-image-puller-mxrbm 0 (0%) 0 (0%) 0 (0%) 0 (0%) 24h kube-system calico-node-rv7s4 250m (2%) 0 (0%) 0 (0%) 0 (0%) 24h kube-system kube-proxy-rt5bz 0 (0%) 0 (0%) 0 (0%) 0 (0%) 24h kube-system nvidia-device-plugin-daemonset-srgp4 0 (0%) 0 (0%) 0 (0%) 0 (0%) 24h rook-ceph csi-cephfsplugin-mjrvx 0 (0%) 0 (0%) 0 (0%) 0 (0%) 24h rook-ceph csi-rbdplugin-495rc 0 (0%) 0 (0%) 0 (0%) 0 (0%) 24h Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits
cpu 250m (2%) 0 (0%)
memory 0 (0%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events:
nvidia-smi
Wed Jun 15 20:50:19 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:15:00.0 Off | N/A | | 27% 63C P0 71W / 250W | 0MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... Off | 00000000:21:00.0 Off | N/A | | 42% 72C P0 73W / 250W | 0MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
and logs from nvidia device plugin?
and logs from nvidia device plugin?
kube-test-container-584687df5f-vwz7k 1/1 Running 4 (24d ago) 85d nvidia-device-plugin-1655230975-22s26 1/1 Running 0 37h nvidia-device-plugin-1655230975-jnkgm 1/1 Running 0 37h nvidia-device-plugin-1655230975-kswvx 1/1 Running 0 37h nvidia-device-plugin-1655230975-wvvnp 0/1 CrashLoopBackOff 438 (3m9s ago) 37h proxy-5c9494449-mdbzs 1/1 Running 0 39h
kubectl logs nvidia-device-plugin-1655230975-wvvnp
2022/06/16 07:21:18 Starting FS watcher.
2022/06/16 07:21:18 Starting OS watcher.
2022/06/16 07:21:18 Starting Plugins.
2022/06/16 07:21:18 Loading configuration.
2022/06/16 07:21:18 Initializing NVML.
2022/06/16 07:21:18 Failed to initialize NVML: could not load NVML library.
2022/06/16 07:21:18 If this is a GPU node, did you set the docker default runtime to nvidia
?
2022/06/16 07:21:18 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2022/06/16 07:21:18 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2022/06/16 07:21:18 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
2022/06/16 07:21:18 Error: error starting plugins: failed to initialize NVML: could not load NVML library
kubectl logs nvidia-device-plugin-1655230975-kswvx
2022/06/14 18:23:11 Starting FS watcher. 2022/06/14 18:23:11 Starting OS watcher. 2022/06/14 18:23:11 Starting Plugins. 2022/06/14 18:23:11 Loading configuration. 2022/06/14 18:23:11 Initializing NVML. 2022/06/14 18:23:11 Updating config with default resource matching patterns. 2022/06/14 18:23:11 Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": true, "nvidiaDriverRoot": "/", "plugin": { "passDeviceSpecs": false, "deviceListStrategy": "envvar", "deviceIDStrategy": "uuid" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} } } 2022/06/14 18:23:11 Retreiving plugins. 2022/06/14 18:23:11 Starting GRPC server for 'nvidia.com/gpu' 2022/06/14 18:23:11 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock 2022/06/14 18:23:11 Registered device plugin for 'nvidia.com/gpu' with Kubelet
kubectl logs nvidia-device-plugin-1655230975-jnkgm
2022/06/14 18:23:13 Starting FS watcher. 2022/06/14 18:23:13 Starting OS watcher. 2022/06/14 18:23:13 Starting Plugins. 2022/06/14 18:23:13 Loading configuration. 2022/06/14 18:23:13 Initializing NVML. 2022/06/14 18:23:13 Updating config with default resource matching patterns. 2022/06/14 18:23:13 Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": true, "nvidiaDriverRoot": "/", "plugin": { "passDeviceSpecs": false, "deviceListStrategy": "envvar", "deviceIDStrategy": "uuid" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} } } 2022/06/14 18:23:13 Retreiving plugins. 2022/06/14 18:23:13 Starting GRPC server for 'nvidia.com/gpu' 2022/06/14 18:23:13 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock 2022/06/14 18:23:13 Registered device plugin for 'nvidia.com/gpu' with Kubelet
kubectl logs nvidia-device-plugin-1655230975-22s26
2022/06/14 18:23:12 Starting FS watcher. 2022/06/14 18:23:12 Starting OS watcher. 2022/06/14 18:23:12 Starting Plugins. 2022/06/14 18:23:12 Loading configuration. 2022/06/14 18:23:12 Initializing NVML. 2022/06/14 18:23:12 Updating config with default resource matching patterns. 2022/06/14 18:23:12 Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": true, "nvidiaDriverRoot": "/", "plugin": { "passDeviceSpecs": false, "deviceListStrategy": "envvar", "deviceIDStrategy": "uuid" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} } } 2022/06/14 18:23:12 Retreiving plugins. 2022/06/14 18:23:12 Starting GRPC server for 'nvidia.com/gpu' 2022/06/14 18:23:12 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock 2022/06/14 18:23:12 Registered device plugin for 'nvidia.com/gpu' with Kubelet
How many nodes do you have? One of them is missing NVML library... try to find the solution to this
How many nodes do you have? One of them is missing NVML library... try to find the solution to this
5 nodes, but for some reason only 4 are visible and nvidia-smi is visible everywhere
How many nodes do you have? One of them is missing NVML library... try to find the solution to this
On this occasion, I do not understand what to do next.
#332
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.