k8s-device-plugin
k8s-device-plugin copied to clipboard
Container fails to initialize NVML even after setting default docker runtime=nvidia
1. Issue or feature description
Deploying nvidia-device-plugin-daemonset reports container fails to initialize NVML even after setting default docker runtime=nviida
2. Steps to reproduce the issue
- Installed nvidia-docker2
sudo apt-get update && sudo apt-get install -y nvidia-docker2
- /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"storage-driver": "overlay2"
}
-
Restart docker daemon
-
Installed minikube (as per https://kubernetes.io/docs/tasks/tools/install-minikube/)
-
Start minikube kubernetes
minikube start --kubernetes-version='1.15.2' --force-systemd=true
- Deploy nvidia-device-plugin-daemonset
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/deployments/static/nvidia-device-plugin.yml
- Validate nvidia-device-plugin pod
kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
default cuda-vector-add 0/1 Pending 0 94m
default hello-minikube-6fb6cb79cc-7b4b4 1/1 Running 1 47m
kube-system coredns-5c98db65d4-r2slv 1/1 Running 4 97m
kube-system coredns-5c98db65d4-z7tgt 1/1 Running 4 97m
kube-system etcd-minikube 1/1 Running 3 97m
kube-system kube-apiserver-minikube 1/1 Running 3 96m
kube-system kube-controller-manager-minikube 1/1 Running 2 70m
kube-system kube-proxy-d4zl6 1/1 Running 3 97m
kube-system kube-scheduler-minikube 1/1 Running 3 97m
kube-system nvidia-device-plugin-daemonset-62wnm 1/1 Running 1 29m
kube-system storage-provisioner 1/1 Running 7 98m
- However, logs report that pod cannot initialize NVML library
kubectl logs nvidia-device-plugin-daemonset-62wnm -n kube-system
2020/07/01 17:04:08 Loading NVML
2020/07/01 17:04:08 Failed to initialize NVML: could not load NVML library.
2020/07/01 17:04:08 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2020/07/01 17:04:08 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2020/07/01 17:04:08 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
NAME GPU
minikube <none>
3. Information to attach (optional if deemed irrelevant)
Common error checking:
- [x] The output of
nvidia-smi -aon your host
==============NVSMI LOG==============
Timestamp : Thu Jul 2 01:30:38 2020
Driver Version : 440.100
CUDA Version : 10.2
Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : Quadro RTX 5000
Product Brand : Quadro
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-0c0357fc-c068-1a86-4f65-1943a24500bc
Minor Number : 0
VBIOS Version : 90.04.52.00.36
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : N/A
Inforom Version
Image Version : G001.0000.02.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x1EB510DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x09271028
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 1000 KB/s
Rx Throughput : 1000 KB/s
Fan Speed : N/A
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 16125 MiB
Used : 223 MiB
Free : 15902 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 5 MiB
Free : 251 MiB
Compute Mode : Default
Utilization
Gpu : 3 %
Memory : 1 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Temperature
GPU Current Temp : 51 C
GPU Shutdown Temp : 102 C
GPU Slowdown Temp : 97 C
GPU Max Operating Temp : 87 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : N/A
Power Draw : 8.63 W
Power Limit : N/A
Default Power Limit : N/A
Enforced Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 300 MHz
SM : 300 MHz
Memory : 405 MHz
Video : 540 MHz
Applications Clocks
Graphics : 1035 MHz
Memory : 7001 MHz
Default Applications Clocks
Graphics : 1035 MHz
Memory : 7001 MHz
Max Clocks
Graphics : 2100 MHz
SM : 2100 MHz
Memory : 7001 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
Process ID : 2120
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 129 MiB
Process ID : 2255
Type : G
Name : /usr/bin/gnome-shell
Used GPU Memory : 92 MiB
- [x] Your docker configuration file (e.g:
/etc/docker/daemon.json) - [x] The k8s-device-plugin container logs
- [x] The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet)
Jul 02 01:34:25 joshua-Precision-7740 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Jul 02 01:34:25 joshua-Precision-7740 systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Jul 02 01:34:25 joshua-Precision-7740 kubelet[27518]: F0702 01:34:25.894995 27518 server.go:198] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file
Jul 02 01:34:25 joshua-Precision-7740 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Jul 02 01:34:25 joshua-Precision-7740 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Jul 02 01:34:25 joshua-Precision-7740 systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 359.
Jul 02 01:34:25 joshua-Precision-7740 systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Jul 02 01:34:15 joshua-Precision-7740 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Jul 02 01:34:15 joshua-Precision-7740 systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Jul 02 01:34:15 joshua-Precision-7740 kubelet[27409]: F0702 01:34:15.650766 27409 server.go:198] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file
Jul 02 01:34:15 joshua-Precision-7740 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Jul 02 01:34:15 joshua-Precision-7740 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Jul 02 01:34:15 joshua-Precision-7740 systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 358.
Jul 02 01:34:15 joshua-Precision-7740 systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Additional information that might help better understand your environment and reproduce the bug:
- [x] Docker version from
docker version
Client: Docker Engine - Community
Version: 19.03.11
API version: 1.40
Go version: go1.13.10
Git commit: 42e35e61f3
Built: Mon Jun 1 09:12:22 2020
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.11
API version: 1.40 (minimum version 1.12)
Go version: go1.13.10
Git commit: 42e35e61f3
Built: Mon Jun 1 09:10:54 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.2.13
GitCommit: 7ad184331fa3e55e52b890ea95e65ba581ae3429
nvidia:
Version: 1.0.0-rc10
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version: 0.18.0
GitCommit: fec3683
- [ ] Docker command, image and tag used
- [ ] Kernel version from
uname -a - [ ] Any relevant kernel output lines from
dmesg - [ ] NVIDIA packages version from
dpkg -l '*nvidia*'orrpm -qa '*nvidia*' - [ ] NVIDIA container library version from
nvidia-container-cli -V - [ ] NVIDIA container library logs (see troubleshooting)
I see you are using minikube. Did you set it up for GPU passthrough? https://minikube.sigs.k8s.io/docs/tutorials/nvidia_gpu/
Because it was a host system installation of docker and the nvidia driver, i followed the steps for "driver=none" https://minikube.sigs.k8s.io/docs/tutorials/nvidia_gpu/#using-the-none-driver
Steps indicated:
- Install minikube (as indicated above)
- Install the nvidia driver (apt get nvidia-driver-440), nvidia-docker (apt get nvidia-docker2) and configure docker with nvidia as the default runtime (etc/docker/daemon.json). See instructions at https://github.com/NVIDIA/nvidia-docker
- Start minikube (step 5 above) & deploy daemonset (step 6 above)
I'm also seeing this error with my microk8s cluster. /etc/docker/daemon.json:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
$ kubectl -n kube-system logs nvidia-device-plugin-daemonset-sbqjz
2020/10/14 18:04:53 Loading NVML
2020/10/14 18:04:53 Failed to initialize NVML: could not load NVML library.
2020/10/14 18:04:53 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2020/10/14 18:04:53 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2020/10/14 18:04:53 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
nvidia-smi -a
==============NVSMI LOG==============
Timestamp : Wed Oct 14 18:07:18 2020
Driver Version : 450.80.02
CUDA Version : 11.0
Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : GeForce GTX 960M
Product Brand : GeForce
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-79c1533e-4055-68e6-78a7-fa8fd41f6e10
Minor Number : 0
VBIOS Version : 82.07.7C.00.0C
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : N/A
Inforom Version
Image Version : N/A
OEM Object : N/A
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x139B10DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x397817AA
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : N/A
HW Power Brake Slowdown : N/A
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 4046 MiB
Used : 63 MiB
Free : 3983 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 1 MiB
Free : 255 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 43 C
GPU Shutdown Temp : 101 C
GPU Slowdown Temp : 96 C
GPU Max Operating Temp : 92 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : N/A
Power Draw : N/A
Power Limit : N/A
Default Power Limit : N/A
Enforced Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 135 MHz
SM : 135 MHz
Memory : 405 MHz
Video : 405 MHz
Applications Clocks
Graphics : 1097 MHz
Memory : 2505 MHz
Default Applications Clocks
Graphics : 1097 MHz
Memory : 2505 MHz
Max Clocks
Graphics : 1202 MHz
SM : 1202 MHz
Memory : 2505 MHz
Video : 1081 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 1405
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 55 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 2537
Type : G
Name : /usr/bin/gnome-shell
Used GPU Memory : 6 MiB
I've tried restarting the docker daemon, microk8s, and the machine itself, to no avail :/
As far as I know, microk8s uses standalone containerd (not docker), so you will need to set the default runtime in containerd to nvidia-container-runtime instead of in docker.
Various successful containerd configs that I've seen (i.e. diffs to /etc/containerd/config.toml):
[plugins.linux]
- runtime = "runc"
+ runtime = "nvidia-container-runtime"
--- default_config.toml 2020-09-03 14:50:45.672414075 +0000
+++ /etc/containerd/config.toml 2020-09-03 06:14:41.402469283 +0000
@@ -65,7 +65,7 @@
disable_proc_mount = false
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs"
- default_runtime_name = "runc"
+ default_runtime_name = "nvidia-container-runtime"
no_pivot = false
[plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
runtime_type = ""
@@ -78,6 +78,11 @@
runtime_root = ""
privileged_without_host_devices = false
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
+ [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-container-runtime]
+ runtime_type = "io.containerd.runtime.v1.linux"
+ runtime_engine = ""
+ runtime_root = ""
+ privileged_without_host_devices = false
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v1"
runtime_engine = ""
@@ -105,7 +110,7 @@
no_prometheus = false
[plugins."io.containerd.runtime.v1.linux"]
shim = "containerd-shim"
- runtime = "runc"
+ runtime = "nvidia-container-runtime"
runtime_root = ""
no_shim = false
shim_debug = false
[plugins.linux]
runtime = "nvidia-container-runtime"
[plugins.cri]
disable_tcp_service = true
[plugins.cri.containerd]
default_runtime_name = "nvidia-container-runtime"
[plugins.cri.containerd.runtimes]
[plugins.cri.containerd.runtimes.nvidia-container-runtime]
runtime_type = "io.containerd.runtime.v1.linux"
runtime_engine = ""
runtime_root = ""
privileged_without_host_devices = false
Thanks @klueska! I'm a little embarrassed I missed that minor detail that microk8s doesn't use docker...
Here's my current /etc/containerd/config.toml (everything that's not a comment):
[plugins.linux]
runtime = "nvidia-container-runtime"
[plugins.cri]
disable_tcp_service = true
[plugins.cri.containerd]
default_runtime_name = "nvidia-container-runtime"
[plugins.cri.containerd.runtimes]
[plugins.cri.containerd.runtimes.nvidia-container-runtime]
runtime_type = "io.containerd.runtime.v1.linux"
runtime_engine = ""
runtime_root = ""
privileged_without_host_devices = false
And here's the output of containerd config dump (sorry, I tried to put this in a <details> but the spacing got jacked up):
version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
plugin_dir = ""
disabled_plugins = []
required_plugins = []
oom_score = 0
imports = ["/etc/containerd/config.toml"]
[grpc]
address = "/run/containerd/containerd.sock"
tcp_address = ""
tcp_tls_cert = ""
tcp_tls_key = ""
uid = 0
gid = 0
max_recv_message_size = 16777216
max_send_message_size = 16777216
[ttrpc]
address = ""
uid = 0
gid = 0
[debug]
address = ""
uid = 0
gid = 0
level = ""
[metrics]
address = ""
grpc_histogram = false
[cgroup]
path = ""
[timeouts]
"io.containerd.timeout.shim.cleanup" = "5s"
"io.containerd.timeout.shim.load" = "5s"
"io.containerd.timeout.shim.shutdown" = "3s"
"io.containerd.timeout.task.state" = "2s"
[plugins]
[plugins."io.containerd.gc.v1.scheduler"]
pause_threshold = 0.02
deletion_threshold = 0
mutation_threshold = 100
schedule_delay = "0s"
startup_delay = "100ms"
[plugins."io.containerd.grpc.v1.cri"]
disable_tcp_service = true
stream_server_address = "127.0.0.1"
stream_server_port = "0"
stream_idle_timeout = "4h0m0s"
enable_selinux = false
sandbox_image = "k8s.gcr.io/pause:3.1"
stats_collect_period = 10
systemd_cgroup = false
enable_tls_streaming = false
max_container_log_line_size = 16384
disable_cgroup = false
disable_apparmor = false
restrict_oom_score_adj = false
max_concurrent_downloads = 3
disable_proc_mount = false
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs"
default_runtime_name = "nvidia-container-runtime"
no_pivot = false
[plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
runtime_type = ""
runtime_engine = ""
runtime_root = ""
privileged_without_host_devices = false
[plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
runtime_type = ""
runtime_engine = ""
runtime_root = ""
privileged_without_host_devices = false
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-container-runtime]
runtime_type = "io.containerd.runtime.v1.linux"
runtime_engine = ""
runtime_root = ""
privileged_without_host_devices = false
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v1"
runtime_engine = ""
runtime_root = ""
privileged_without_host_devices = false
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
max_conf_num = 1
conf_template = ""
[plugins."io.containerd.grpc.v1.cri".registry]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
endpoint = ["https://registry-1.docker.io"]
[plugins."io.containerd.grpc.v1.cri".x509_key_pair_streaming]
tls_cert_file = ""
tls_key_file = ""
[plugins."io.containerd.internal.v1.opt"]
path = "/opt/containerd"
[plugins."io.containerd.internal.v1.restart"]
interval = "10s"
[plugins."io.containerd.metadata.v1.bolt"]
content_sharing_policy = "shared"
[plugins."io.containerd.monitor.v1.cgroups"]
no_prometheus = false
[plugins."io.containerd.runtime.v1.linux"]
shim = "containerd-shim"
runtime = "nvidia-container-runtime"
runtime_root = ""
no_shim = false
shim_debug = false
[plugins."io.containerd.runtime.v2.task"]
platforms = ["linux/amd64"]
[plugins."io.containerd.service.v1.diff-service"]
default = ["walking"]
[plugins."io.containerd.snapshotter.v1.devmapper"]
root_path = ""
pool_name = ""
base_image_size = ""
I'm still seeing the same error as before. It strange that there's no plugins.cri section in containerd config dump, but at the same time, it seems to have picked up on the nvidia-container-runtime in several spots. Any thoughts?
I'm not that familiar with the containrd config file (the nvidia-container-toolkit on which the k8s-device-plugin relies does not yet have official support for containerd). The examples I posted above are from a few cases where I saw people getting it to work. We should be adding official containerd support to the nvidia-container-toolkit soon (i.e. within the next 2 months), but until then I don't have any better suggestions than -- play around with the config and see if you can get it to work.
I'd be interested to see what you come up with.
I know @tvansteenburgh from Canonical was playing around with this as well (specifically for the use case of getting microk8s to run with GPUs), but I'm not sure how far along he's gotten.
Just out of curiousity, I assume you already ran:
microk8s enable gpu
It's not strictly necessary, but doing so will setup everything you need (including your continerd config) to allow GPUs to be used.
One of which is to enable the nvidia-container-runtime as the default runtime in the containerd config used by microk8s: /var/snap/microk8s/current/args/containerd.toml
I did give microk8s enable gpu a shot; it seems to be roughly equivalent to applying the daemonset config from this repo manually.
My (admittedly not super satisfying) resolution to this was to switch from microk8s to k3s, which can be configured to use docker rather than containerd. Thanks for pointing me in the right direction!
Hi there,
Do you happen to be run this on a hybrid GPU (e.g. laptop combining intel and nvidia)?
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
This issue was automatically closed due to inactivity.