k8s-device-plugin Container fails to initialize NVML even after setting default docker runtime=nvidia

1. Issue or feature description

Deploying nvidia-device-plugin-daemonset reports container fails to initialize NVML even after setting default docker runtime=nviida

2. Steps to reproduce the issue

Installed nvidia-docker2

sudo apt-get update && sudo apt-get install -y nvidia-docker2

/etc/docker/daemon.json

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "exec-opts": ["native.cgroupdriver=systemd"],
    "log-driver": "json-file",
    "log-opts": {
        "max-size": "100m"
    },
    "storage-driver": "overlay2"
}

Restart docker daemon
Installed minikube (as per https://kubernetes.io/docs/tasks/tools/install-minikube/)
Start minikube kubernetes

minikube start --kubernetes-version='1.15.2' --force-systemd=true

Deploy nvidia-device-plugin-daemonset

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/deployments/static/nvidia-device-plugin.yml

Validate nvidia-device-plugin pod

kubectl get pods --all-namespaces
NAMESPACE     NAME                                   READY   STATUS    RESTARTS   AGE
default       cuda-vector-add                        0/1     Pending   0          94m
default       hello-minikube-6fb6cb79cc-7b4b4        1/1     Running   1          47m
kube-system   coredns-5c98db65d4-r2slv               1/1     Running   4          97m
kube-system   coredns-5c98db65d4-z7tgt               1/1     Running   4          97m
kube-system   etcd-minikube                          1/1     Running   3          97m
kube-system   kube-apiserver-minikube                1/1     Running   3          96m
kube-system   kube-controller-manager-minikube       1/1     Running   2          70m
kube-system   kube-proxy-d4zl6                       1/1     Running   3          97m
kube-system   kube-scheduler-minikube                1/1     Running   3          97m
kube-system   nvidia-device-plugin-daemonset-62wnm   1/1     Running   1          29m
kube-system   storage-provisioner                    1/1     Running   7          98m

However, logs report that pod cannot initialize NVML library

kubectl logs nvidia-device-plugin-daemonset-62wnm -n kube-system
2020/07/01 17:04:08 Loading NVML
2020/07/01 17:04:08 Failed to initialize NVML: could not load NVML library.
2020/07/01 17:04:08 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2020/07/01 17:04:08 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2020/07/01 17:04:08 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start

kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
NAME       GPU
minikube   <none>

3. Information to attach (optional if deemed irrelevant)

Common error checking:

[x] The output of nvidia-smi -a on your host

==============NVSMI LOG==============

Timestamp                           : Thu Jul  2 01:30:38 2020
Driver Version                      : 440.100
CUDA Version                        : 10.2

Attached GPUs                       : 1
GPU 00000000:01:00.0
    Product Name                    : Quadro RTX 5000
    Product Brand                   : Quadro
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 4000
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : N/A
    GPU UUID                        : GPU-0c0357fc-c068-1a86-4f65-1943a24500bc
    Minor Number                    : 0
    VBIOS Version                   : 90.04.52.00.36
    MultiGPU Board                  : No
    Board ID                        : 0x100
    GPU Part Number                 : N/A
    Inforom Version
        Image Version               : G001.0000.02.04
        OEM Object                  : 1.1
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization Mode         : None
        Host VGPU Mode              : N/A
    IBMNPU
        Relaxed Ordering Mode       : N/A
    PCI
        Bus                         : 0x01
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x1EB510DE
        Bus Id                      : 00000000:01:00.0
        Sub System Id               : 0x09271028
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays Since Reset         : 0
        Replay Number Rollovers     : 0
        Tx Throughput               : 1000 KB/s
        Rx Throughput               : 1000 KB/s
    Fan Speed                       : N/A
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
        Display Clock Setting       : Not Active
    FB Memory Usage
        Total                       : 16125 MiB
        Used                        : 223 MiB
        Free                        : 15902 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 5 MiB
        Free                        : 251 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 3 %
        Memory                      : 1 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Encoder Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    FBC Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            SRAM Correctable        : N/A
            SRAM Uncorrectable      : N/A
            DRAM Correctable        : N/A
            DRAM Uncorrectable      : N/A
        Aggregate
            SRAM Correctable        : N/A
            SRAM Uncorrectable      : N/A
            DRAM Correctable        : N/A
            DRAM Uncorrectable      : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending Page Blacklist      : N/A
    Temperature
        GPU Current Temp            : 51 C
        GPU Shutdown Temp           : 102 C
        GPU Slowdown Temp           : 97 C
        GPU Max Operating Temp      : 87 C
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A
    Power Readings
        Power Management            : N/A
        Power Draw                  : 8.63 W
        Power Limit                 : N/A
        Default Power Limit         : N/A
        Enforced Power Limit        : N/A
        Min Power Limit             : N/A
        Max Power Limit             : N/A
    Clocks
        Graphics                    : 300 MHz
        SM                          : 300 MHz
        Memory                      : 405 MHz
        Video                       : 540 MHz
    Applications Clocks
        Graphics                    : 1035 MHz
        Memory                      : 7001 MHz
    Default Applications Clocks
        Graphics                    : 1035 MHz
        Memory                      : 7001 MHz
    Max Clocks
        Graphics                    : 2100 MHz
        SM                          : 2100 MHz
        Memory                      : 7001 MHz
        Video                       : 1950 MHz
    Max Customer Boost Clocks
        Graphics                    : N/A
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes
        Process ID                  : 2120
            Type                    : G
            Name                    : /usr/lib/xorg/Xorg
            Used GPU Memory         : 129 MiB
        Process ID                  : 2255
            Type                    : G
            Name                    : /usr/bin/gnome-shell
            Used GPU Memory         : 92 MiB

[x] Your docker configuration file (e.g: /etc/docker/daemon.json)
[x] The k8s-device-plugin container logs
[x] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Jul 02 01:34:25 joshua-Precision-7740 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Jul 02 01:34:25 joshua-Precision-7740 systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Jul 02 01:34:25 joshua-Precision-7740 kubelet[27518]: F0702 01:34:25.894995   27518 server.go:198] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file
Jul 02 01:34:25 joshua-Precision-7740 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Jul 02 01:34:25 joshua-Precision-7740 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Jul 02 01:34:25 joshua-Precision-7740 systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 359.
Jul 02 01:34:25 joshua-Precision-7740 systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Jul 02 01:34:15 joshua-Precision-7740 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Jul 02 01:34:15 joshua-Precision-7740 systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Jul 02 01:34:15 joshua-Precision-7740 kubelet[27409]: F0702 01:34:15.650766   27409 server.go:198] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file
Jul 02 01:34:15 joshua-Precision-7740 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Jul 02 01:34:15 joshua-Precision-7740 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Jul 02 01:34:15 joshua-Precision-7740 systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 358.
Jul 02 01:34:15 joshua-Precision-7740 systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.

Additional information that might help better understand your environment and reproduce the bug:

[x] Docker version from docker version

Client: Docker Engine - Community
 Version:           19.03.11
 API version:       1.40
 Go version:        go1.13.10
 Git commit:        42e35e61f3
 Built:             Mon Jun  1 09:12:22 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.11
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.10
  Git commit:       42e35e61f3
  Built:            Mon Jun  1 09:10:54 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.13
  GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
 nvidia:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

[ ] Docker command, image and tag used
[ ] Kernel version from uname -a
[ ] Any relevant kernel output lines from dmesg
[ ] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
[ ] NVIDIA container library version from nvidia-container-cli -V
[ ] NVIDIA container library logs (see troubleshooting)

Jul 01 '20 17:07 limwenyao

I see you are using minikube. Did you set it up for GPU passthrough? https://minikube.sigs.k8s.io/docs/tutorials/nvidia_gpu/

Jul 01 '20 17:07 klueska

Because it was a host system installation of docker and the nvidia driver, i followed the steps for "driver=none" https://minikube.sigs.k8s.io/docs/tutorials/nvidia_gpu/#using-the-none-driver

Steps indicated:

Install minikube (as indicated above)
Install the nvidia driver (apt get nvidia-driver-440), nvidia-docker (apt get nvidia-docker2) and configure docker with nvidia as the default runtime (etc/docker/daemon.json). See instructions at https://github.com/NVIDIA/nvidia-docker
Start minikube (step 5 above) & deploy daemonset (step 6 above)

Jul 01 '20 17:07 limwenyao

I'm also seeing this error with my microk8s cluster. /etc/docker/daemon.json:

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

$ kubectl -n kube-system logs nvidia-device-plugin-daemonset-sbqjz
2020/10/14 18:04:53 Loading NVML
2020/10/14 18:04:53 Failed to initialize NVML: could not load NVML library.
2020/10/14 18:04:53 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2020/10/14 18:04:53 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2020/10/14 18:04:53 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start

nvidia-smi -a


==============NVSMI LOG==============
Timestamp                                 : Wed Oct 14 18:07:18 2020
Driver Version                            : 450.80.02
CUDA Version                              : 11.0
Attached GPUs                             : 1
GPU 00000000:01:00.0
Product Name                          : GeForce GTX 960M
Product Brand                         : GeForce
Display Mode                          : Disabled
Display Active                        : Disabled
Persistence Mode                      : Disabled
MIG Mode
Current                           : N/A
Pending                           : N/A
Accounting Mode                       : Disabled
Accounting Mode Buffer Size           : 4000
Driver Model
Current                           : N/A
Pending                           : N/A
Serial Number                         : N/A
GPU UUID                              : GPU-79c1533e-4055-68e6-78a7-fa8fd41f6e10
Minor Number                          : 0
VBIOS Version                         : 82.07.7C.00.0C
MultiGPU Board                        : No
Board ID                              : 0x100
GPU Part Number                       : N/A
Inforom Version
Image Version                     : N/A
OEM Object                        : N/A
ECC Object                        : N/A
Power Management Object           : N/A
GPU Operation Mode
Current                           : N/A
Pending                           : N/A
GPU Virtualization Mode
Virtualization Mode               : None
Host VGPU Mode                    : N/A
IBMNPU
Relaxed Ordering Mode             : N/A
PCI
Bus                               : 0x01
Device                            : 0x00
Domain                            : 0x0000
Device Id                         : 0x139B10DE
Bus Id                            : 00000000:01:00.0
Sub System Id                     : 0x397817AA
GPU Link Info
PCIe Generation
Max                       : 3
Current                   : 1
Link Width
Max                       : 16x
Current                   : 16x
Bridge Chip
Type                          : N/A
Firmware                      : N/A
Replays Since Reset               : 0
Replay Number Rollovers           : 0
Tx Throughput                     : 0 KB/s
Rx Throughput                     : 0 KB/s
Fan Speed                             : N/A
Performance State                     : P8
Clocks Throttle Reasons
Idle                              : Active
Applications Clocks Setting       : Not Active
SW Power Cap                      : Not Active
HW Slowdown                       : Not Active
HW Thermal Slowdown           : N/A
HW Power Brake Slowdown       : N/A
Sync Boost                        : Not Active
SW Thermal Slowdown               : Not Active
Display Clock Setting             : Not Active
FB Memory Usage
Total                             : 4046 MiB
Used                              : 63 MiB
Free                              : 3983 MiB
BAR1 Memory Usage
Total                             : 256 MiB
Used                              : 1 MiB
Free                              : 255 MiB
Compute Mode                          : Default
Utilization
Gpu                               : 0 %
Memory                            : 0 %
Encoder                           : 0 %
Decoder                           : 0 %
Encoder Stats
Active Sessions                   : 0
Average FPS                       : 0
Average Latency                   : 0
FBC Stats
Active Sessions                   : 0
Average FPS                       : 0
Average Latency                   : 0
Ecc Mode
Current                           : N/A
Pending                           : N/A
ECC Errors
Volatile
Single Bit

Device Memory             : N/A
Register File             : N/A
L1 Cache                  : N/A
L2 Cache                  : N/A
Texture Memory            : N/A
Texture Shared            : N/A
CBU                       : N/A
Total                     : N/A
Double Bit

Device Memory             : N/A
Register File             : N/A
L1 Cache                  : N/A
L2 Cache                  : N/A
Texture Memory            : N/A
Texture Shared            : N/A
CBU                       : N/A
Total                     : N/A
Aggregate
Single Bit

Device Memory             : N/A
Register File             : N/A
L1 Cache                  : N/A
L2 Cache                  : N/A
Texture Memory            : N/A
Texture Shared            : N/A
CBU                       : N/A
Total                     : N/A
Double Bit

Device Memory             : N/A
Register File             : N/A
L1 Cache                  : N/A
L2 Cache                  : N/A
Texture Memory            : N/A
Texture Shared            : N/A
CBU                       : N/A
Total                     : N/A
Retired Pages
Single Bit ECC                    : N/A
Double Bit ECC                    : N/A
Pending Page Blacklist            : N/A
Remapped Rows                         : N/A
Temperature
GPU Current Temp                  : 43 C
GPU Shutdown Temp                 : 101 C
GPU Slowdown Temp                 : 96 C
GPU Max Operating Temp            : 92 C
Memory Current Temp               : N/A
Memory Max Operating Temp         : N/A
Power Readings
Power Management                  : N/A
Power Draw                        : N/A
Power Limit                       : N/A
Default Power Limit               : N/A
Enforced Power Limit              : N/A
Min Power Limit                   : N/A
Max Power Limit                   : N/A
Clocks
Graphics                          : 135 MHz
SM                                : 135 MHz
Memory                            : 405 MHz
Video                             : 405 MHz
Applications Clocks
Graphics                          : 1097 MHz
Memory                            : 2505 MHz
Default Applications Clocks
Graphics                          : 1097 MHz
Memory                            : 2505 MHz
Max Clocks
Graphics                          : 1202 MHz
SM                                : 1202 MHz
Memory                            : 2505 MHz
Video                             : 1081 MHz
Max Customer Boost Clocks
Graphics                          : N/A
Clock Policy
Auto Boost                        : N/A
Auto Boost Default                : N/A
Processes
GPU instance ID                   : N/A
Compute instance ID               : N/A
Process ID                        : 1405
Type                          : G
Name                          : /usr/lib/xorg/Xorg
Used GPU Memory               : 55 MiB
GPU instance ID                   : N/A
Compute instance ID               : N/A
Process ID                        : 2537
Type                          : G
Name                          : /usr/bin/gnome-shell
Used GPU Memory               : 6 MiB

I've tried restarting the docker daemon, microk8s, and the machine itself, to no avail :/

Oct 14 '20 18:10 wbadart

As far as I know, microk8s uses standalone containerd (not docker), so you will need to set the default runtime in containerd to nvidia-container-runtime instead of in docker.

Various successful containerd configs that I've seen (i.e. diffs to /etc/containerd/config.toml):

 [plugins.linux]
-  runtime = "runc"
+  runtime = "nvidia-container-runtime"

--- default_config.toml  2020-09-03 14:50:45.672414075 +0000
+++ /etc/containerd/config.toml  2020-09-03 06:14:41.402469283 +0000
@@ -65,7 +65,7 @@
     disable_proc_mount = false
     [plugins."io.containerd.grpc.v1.cri".containerd]
       snapshotter = "overlayfs"
-      default_runtime_name = "runc"
+      default_runtime_name = "nvidia-container-runtime"
       no_pivot = false
       [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
         runtime_type = ""
@@ -78,6 +78,11 @@
         runtime_root = ""
         privileged_without_host_devices = false
       [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
+        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-container-runtime]
+          runtime_type = "io.containerd.runtime.v1.linux"
+          runtime_engine = ""
+          runtime_root = ""
+          privileged_without_host_devices = false
         [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
           runtime_type = "io.containerd.runc.v1"
           runtime_engine = ""
@@ -105,7 +110,7 @@
     no_prometheus = false
   [plugins."io.containerd.runtime.v1.linux"]
     shim = "containerd-shim"
-    runtime = "runc"
+    runtime = "nvidia-container-runtime"
     runtime_root = ""
     no_shim = false
     shim_debug = false

[plugins.linux]
   runtime = "nvidia-container-runtime"
[plugins.cri]
  disable_tcp_service = true
  [plugins.cri.containerd]
    default_runtime_name = "nvidia-container-runtime"
      [plugins.cri.containerd.runtimes]
        [plugins.cri.containerd.runtimes.nvidia-container-runtime]
          runtime_type = "io.containerd.runtime.v1.linux"
          runtime_engine = ""
          runtime_root = ""
          privileged_without_host_devices = false

Oct 16 '20 16:10 klueska

Thanks @klueska! I'm a little embarrassed I missed that minor detail that microk8s doesn't use docker...

Here's my current /etc/containerd/config.toml (everything that's not a comment):

[plugins.linux]
  runtime = "nvidia-container-runtime"
[plugins.cri]
  disable_tcp_service = true
  [plugins.cri.containerd]
    default_runtime_name = "nvidia-container-runtime"
      [plugins.cri.containerd.runtimes]
        [plugins.cri.containerd.runtimes.nvidia-container-runtime]
          runtime_type = "io.containerd.runtime.v1.linux"
          runtime_engine = ""
          runtime_root = ""
          privileged_without_host_devices = false

And here's the output of containerd config dump (sorry, I tried to put this in a <details> but the spacing got jacked up):

version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
plugin_dir = ""
disabled_plugins = []
required_plugins = []
oom_score = 0
imports = ["/etc/containerd/config.toml"]

[grpc]
  address = "/run/containerd/containerd.sock"
  tcp_address = ""
  tcp_tls_cert = ""
  tcp_tls_key = ""
  uid = 0
  gid = 0
  max_recv_message_size = 16777216
  max_send_message_size = 16777216

[ttrpc]
  address = ""
  uid = 0
  gid = 0

[debug]
  address = ""
  uid = 0
  gid = 0
  level = ""

[metrics]
  address = ""
  grpc_histogram = false

[cgroup]
  path = ""

[timeouts]
  "io.containerd.timeout.shim.cleanup" = "5s"
  "io.containerd.timeout.shim.load" = "5s"
  "io.containerd.timeout.shim.shutdown" = "3s"
  "io.containerd.timeout.task.state" = "2s"

[plugins]
  [plugins."io.containerd.gc.v1.scheduler"]
    pause_threshold = 0.02
    deletion_threshold = 0
    mutation_threshold = 100
    schedule_delay = "0s"
    startup_delay = "100ms"
  [plugins."io.containerd.grpc.v1.cri"]
    disable_tcp_service = true
    stream_server_address = "127.0.0.1"
    stream_server_port = "0"
    stream_idle_timeout = "4h0m0s"
    enable_selinux = false
    sandbox_image = "k8s.gcr.io/pause:3.1"
    stats_collect_period = 10
    systemd_cgroup = false
    enable_tls_streaming = false
    max_container_log_line_size = 16384
    disable_cgroup = false
    disable_apparmor = false
    restrict_oom_score_adj = false
    max_concurrent_downloads = 3
    disable_proc_mount = false
    [plugins."io.containerd.grpc.v1.cri".containerd]
      snapshotter = "overlayfs"
      default_runtime_name = "nvidia-container-runtime"
      no_pivot = false
      [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
        runtime_type = ""
        runtime_engine = ""
        runtime_root = ""
        privileged_without_host_devices = false
      [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
        runtime_type = ""
        runtime_engine = ""
        runtime_root = ""
        privileged_without_host_devices = false
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-container-runtime]
          runtime_type = "io.containerd.runtime.v1.linux"
          runtime_engine = ""
          runtime_root = ""
          privileged_without_host_devices = false
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v1"
          runtime_engine = ""
          runtime_root = ""
          privileged_without_host_devices = false
    [plugins."io.containerd.grpc.v1.cri".cni]
      bin_dir = "/opt/cni/bin"
      conf_dir = "/etc/cni/net.d"
      max_conf_num = 1
      conf_template = ""
    [plugins."io.containerd.grpc.v1.cri".registry]
      [plugins."io.containerd.grpc.v1.cri".registry.mirrors]
        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
          endpoint = ["https://registry-1.docker.io"]
    [plugins."io.containerd.grpc.v1.cri".x509_key_pair_streaming]
      tls_cert_file = ""
      tls_key_file = ""
  [plugins."io.containerd.internal.v1.opt"]
    path = "/opt/containerd"
  [plugins."io.containerd.internal.v1.restart"]
    interval = "10s"
  [plugins."io.containerd.metadata.v1.bolt"]
    content_sharing_policy = "shared"
  [plugins."io.containerd.monitor.v1.cgroups"]
    no_prometheus = false
  [plugins."io.containerd.runtime.v1.linux"]
    shim = "containerd-shim"
    runtime = "nvidia-container-runtime"
    runtime_root = ""
    no_shim = false
    shim_debug = false
  [plugins."io.containerd.runtime.v2.task"]
    platforms = ["linux/amd64"]
  [plugins."io.containerd.service.v1.diff-service"]
    default = ["walking"]
  [plugins."io.containerd.snapshotter.v1.devmapper"]
    root_path = ""
    pool_name = ""
    base_image_size = ""

I'm still seeing the same error as before. It strange that there's no plugins.cri section in containerd config dump, but at the same time, it seems to have picked up on the nvidia-container-runtime in several spots. Any thoughts?

Oct 19 '20 23:10 wbadart

I'm not that familiar with the containrd config file (the nvidia-container-toolkit on which the k8s-device-plugin relies does not yet have official support for containerd). The examples I posted above are from a few cases where I saw people getting it to work. We should be adding official containerd support to the nvidia-container-toolkit soon (i.e. within the next 2 months), but until then I don't have any better suggestions than -- play around with the config and see if you can get it to work.

I'd be interested to see what you come up with.

I know @tvansteenburgh from Canonical was playing around with this as well (specifically for the use case of getting microk8s to run with GPUs), but I'm not sure how far along he's gotten.

Oct 21 '20 23:10 klueska

Just out of curiousity, I assume you already ran:

microk8s enable gpu

It's not strictly necessary, but doing so will setup everything you need (including your continerd config) to allow GPUs to be used.

One of which is to enable the nvidia-container-runtime as the default runtime in the containerd config used by microk8s: /var/snap/microk8s/current/args/containerd.toml

Oct 23 '20 01:10 klueska

I did give microk8s enable gpu a shot; it seems to be roughly equivalent to applying the daemonset config from this repo manually.

My (admittedly not super satisfying) resolution to this was to switch from microk8s to k3s, which can be configured to use docker rather than containerd. Thanks for pointing me in the right direction!

Nov 02 '20 16:11 wbadart

Hi there,

Do you happen to be run this on a hybrid GPU (e.g. laptop combining intel and nvidia)?

Nov 13 '20 18:11 joedborg

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

Feb 29 '24 04:02 github-actions[bot]

This issue was automatically closed due to inactivity.

Mar 31 '24 04:03 github-actions[bot]

k8s-device-plugin k8s-device-plugin copied to clipboard

Container fails to initialize NVML even after setting default docker runtime=nvidia

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

k8s-device-plugin
k8s-device-plugin copied to clipboard