k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

Container fails to initialize NVML even after setting default docker runtime=nvidia

Open limwenyao opened this issue 5 years ago • 9 comments

1. Issue or feature description

Deploying nvidia-device-plugin-daemonset reports container fails to initialize NVML even after setting default docker runtime=nviida

2. Steps to reproduce the issue

  1. Installed nvidia-docker2
sudo apt-get update && sudo apt-get install -y nvidia-docker2
  1. /etc/docker/daemon.json
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "exec-opts": ["native.cgroupdriver=systemd"],
    "log-driver": "json-file",
    "log-opts": {
        "max-size": "100m"
    },
    "storage-driver": "overlay2"
}
  1. Restart docker daemon

  2. Installed minikube (as per https://kubernetes.io/docs/tasks/tools/install-minikube/)

  3. Start minikube kubernetes

minikube start --kubernetes-version='1.15.2' --force-systemd=true
  1. Deploy nvidia-device-plugin-daemonset
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/deployments/static/nvidia-device-plugin.yml
  1. Validate nvidia-device-plugin pod
kubectl get pods --all-namespaces
NAMESPACE     NAME                                   READY   STATUS    RESTARTS   AGE
default       cuda-vector-add                        0/1     Pending   0          94m
default       hello-minikube-6fb6cb79cc-7b4b4        1/1     Running   1          47m
kube-system   coredns-5c98db65d4-r2slv               1/1     Running   4          97m
kube-system   coredns-5c98db65d4-z7tgt               1/1     Running   4          97m
kube-system   etcd-minikube                          1/1     Running   3          97m
kube-system   kube-apiserver-minikube                1/1     Running   3          96m
kube-system   kube-controller-manager-minikube       1/1     Running   2          70m
kube-system   kube-proxy-d4zl6                       1/1     Running   3          97m
kube-system   kube-scheduler-minikube                1/1     Running   3          97m
kube-system   nvidia-device-plugin-daemonset-62wnm   1/1     Running   1          29m
kube-system   storage-provisioner                    1/1     Running   7          98m
  1. However, logs report that pod cannot initialize NVML library
kubectl logs nvidia-device-plugin-daemonset-62wnm -n kube-system
2020/07/01 17:04:08 Loading NVML
2020/07/01 17:04:08 Failed to initialize NVML: could not load NVML library.
2020/07/01 17:04:08 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2020/07/01 17:04:08 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2020/07/01 17:04:08 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start

kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
NAME       GPU
minikube   <none>

3. Information to attach (optional if deemed irrelevant)

Common error checking:

  • [x] The output of nvidia-smi -a on your host
==============NVSMI LOG==============

Timestamp                           : Thu Jul  2 01:30:38 2020
Driver Version                      : 440.100
CUDA Version                        : 10.2

Attached GPUs                       : 1
GPU 00000000:01:00.0
    Product Name                    : Quadro RTX 5000
    Product Brand                   : Quadro
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 4000
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : N/A
    GPU UUID                        : GPU-0c0357fc-c068-1a86-4f65-1943a24500bc
    Minor Number                    : 0
    VBIOS Version                   : 90.04.52.00.36
    MultiGPU Board                  : No
    Board ID                        : 0x100
    GPU Part Number                 : N/A
    Inforom Version
        Image Version               : G001.0000.02.04
        OEM Object                  : 1.1
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization Mode         : None
        Host VGPU Mode              : N/A
    IBMNPU
        Relaxed Ordering Mode       : N/A
    PCI
        Bus                         : 0x01
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x1EB510DE
        Bus Id                      : 00000000:01:00.0
        Sub System Id               : 0x09271028
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays Since Reset         : 0
        Replay Number Rollovers     : 0
        Tx Throughput               : 1000 KB/s
        Rx Throughput               : 1000 KB/s
    Fan Speed                       : N/A
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
        Display Clock Setting       : Not Active
    FB Memory Usage
        Total                       : 16125 MiB
        Used                        : 223 MiB
        Free                        : 15902 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 5 MiB
        Free                        : 251 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 3 %
        Memory                      : 1 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Encoder Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    FBC Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            SRAM Correctable        : N/A
            SRAM Uncorrectable      : N/A
            DRAM Correctable        : N/A
            DRAM Uncorrectable      : N/A
        Aggregate
            SRAM Correctable        : N/A
            SRAM Uncorrectable      : N/A
            DRAM Correctable        : N/A
            DRAM Uncorrectable      : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending Page Blacklist      : N/A
    Temperature
        GPU Current Temp            : 51 C
        GPU Shutdown Temp           : 102 C
        GPU Slowdown Temp           : 97 C
        GPU Max Operating Temp      : 87 C
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A
    Power Readings
        Power Management            : N/A
        Power Draw                  : 8.63 W
        Power Limit                 : N/A
        Default Power Limit         : N/A
        Enforced Power Limit        : N/A
        Min Power Limit             : N/A
        Max Power Limit             : N/A
    Clocks
        Graphics                    : 300 MHz
        SM                          : 300 MHz
        Memory                      : 405 MHz
        Video                       : 540 MHz
    Applications Clocks
        Graphics                    : 1035 MHz
        Memory                      : 7001 MHz
    Default Applications Clocks
        Graphics                    : 1035 MHz
        Memory                      : 7001 MHz
    Max Clocks
        Graphics                    : 2100 MHz
        SM                          : 2100 MHz
        Memory                      : 7001 MHz
        Video                       : 1950 MHz
    Max Customer Boost Clocks
        Graphics                    : N/A
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes
        Process ID                  : 2120
            Type                    : G
            Name                    : /usr/lib/xorg/Xorg
            Used GPU Memory         : 129 MiB
        Process ID                  : 2255
            Type                    : G
            Name                    : /usr/bin/gnome-shell
            Used GPU Memory         : 92 MiB
  • [x] Your docker configuration file (e.g: /etc/docker/daemon.json)
  • [x] The k8s-device-plugin container logs
  • [x] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
Jul 02 01:34:25 joshua-Precision-7740 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Jul 02 01:34:25 joshua-Precision-7740 systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Jul 02 01:34:25 joshua-Precision-7740 kubelet[27518]: F0702 01:34:25.894995   27518 server.go:198] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file
Jul 02 01:34:25 joshua-Precision-7740 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Jul 02 01:34:25 joshua-Precision-7740 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Jul 02 01:34:25 joshua-Precision-7740 systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 359.
Jul 02 01:34:25 joshua-Precision-7740 systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Jul 02 01:34:15 joshua-Precision-7740 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Jul 02 01:34:15 joshua-Precision-7740 systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Jul 02 01:34:15 joshua-Precision-7740 kubelet[27409]: F0702 01:34:15.650766   27409 server.go:198] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file
Jul 02 01:34:15 joshua-Precision-7740 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Jul 02 01:34:15 joshua-Precision-7740 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Jul 02 01:34:15 joshua-Precision-7740 systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 358.
Jul 02 01:34:15 joshua-Precision-7740 systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.

Additional information that might help better understand your environment and reproduce the bug:

  • [x] Docker version from docker version
Client: Docker Engine - Community
 Version:           19.03.11
 API version:       1.40
 Go version:        go1.13.10
 Git commit:        42e35e61f3
 Built:             Mon Jun  1 09:12:22 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.11
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.10
  Git commit:       42e35e61f3
  Built:            Mon Jun  1 09:10:54 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.13
  GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
 nvidia:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683
  • [ ] Docker command, image and tag used
  • [ ] Kernel version from uname -a
  • [ ] Any relevant kernel output lines from dmesg
  • [ ] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
  • [ ] NVIDIA container library version from nvidia-container-cli -V
  • [ ] NVIDIA container library logs (see troubleshooting)

limwenyao avatar Jul 01 '20 17:07 limwenyao

I see you are using minikube. Did you set it up for GPU passthrough? https://minikube.sigs.k8s.io/docs/tutorials/nvidia_gpu/

klueska avatar Jul 01 '20 17:07 klueska

Because it was a host system installation of docker and the nvidia driver, i followed the steps for "driver=none" https://minikube.sigs.k8s.io/docs/tutorials/nvidia_gpu/#using-the-none-driver

Steps indicated:

  1. Install minikube (as indicated above)
  2. Install the nvidia driver (apt get nvidia-driver-440), nvidia-docker (apt get nvidia-docker2) and configure docker with nvidia as the default runtime (etc/docker/daemon.json). See instructions at https://github.com/NVIDIA/nvidia-docker
  3. Start minikube (step 5 above) & deploy daemonset (step 6 above)

limwenyao avatar Jul 01 '20 17:07 limwenyao

I'm also seeing this error with my microk8s cluster. /etc/docker/daemon.json:

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}
$ kubectl -n kube-system logs nvidia-device-plugin-daemonset-sbqjz
2020/10/14 18:04:53 Loading NVML
2020/10/14 18:04:53 Failed to initialize NVML: could not load NVML library.
2020/10/14 18:04:53 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2020/10/14 18:04:53 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2020/10/14 18:04:53 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
nvidia-smi -a

==============NVSMI LOG==============

Timestamp : Wed Oct 14 18:07:18 2020 Driver Version : 450.80.02 CUDA Version : 11.0

Attached GPUs : 1 GPU 00000000:01:00.0 Product Name : GeForce GTX 960M Product Brand : GeForce Display Mode : Disabled Display Active : Disabled Persistence Mode : Disabled MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-79c1533e-4055-68e6-78a7-fa8fd41f6e10 Minor Number : 0 VBIOS Version : 82.07.7C.00.0C MultiGPU Board : No Board ID : 0x100 GPU Part Number : N/A Inforom Version Image Version : N/A OEM Object : N/A ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x01 Device : 0x00 Domain : 0x0000 Device Id : 0x139B10DE Bus Id : 00000000:01:00.0 Sub System Id : 0x397817AA GPU Link Info PCIe Generation Max : 3 Current : 1 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : N/A Performance State : P8 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : N/A HW Power Brake Slowdown : N/A Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 4046 MiB Used : 63 MiB Free : 3983 MiB BAR1 Memory Usage Total : 256 MiB Used : 1 MiB Free : 255 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile Single Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Aggregate Single Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : 43 C GPU Shutdown Temp : 101 C GPU Slowdown Temp : 96 C GPU Max Operating Temp : 92 C Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : N/A Power Draw : N/A Power Limit : N/A Default Power Limit : N/A Enforced Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A Clocks Graphics : 135 MHz SM : 135 MHz Memory : 405 MHz Video : 405 MHz Applications Clocks Graphics : 1097 MHz Memory : 2505 MHz Default Applications Clocks Graphics : 1097 MHz Memory : 2505 MHz Max Clocks Graphics : 1202 MHz SM : 1202 MHz Memory : 2505 MHz Video : 1081 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Processes GPU instance ID : N/A Compute instance ID : N/A Process ID : 1405 Type : G Name : /usr/lib/xorg/Xorg Used GPU Memory : 55 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 2537 Type : G Name : /usr/bin/gnome-shell Used GPU Memory : 6 MiB

I've tried restarting the docker daemon, microk8s, and the machine itself, to no avail :/

wbadart avatar Oct 14 '20 18:10 wbadart

As far as I know, microk8s uses standalone containerd (not docker), so you will need to set the default runtime in containerd to nvidia-container-runtime instead of in docker.

Various successful containerd configs that I've seen (i.e. diffs to /etc/containerd/config.toml):

 [plugins.linux]
-  runtime = "runc"
+  runtime = "nvidia-container-runtime"
--- default_config.toml  2020-09-03 14:50:45.672414075 +0000
+++ /etc/containerd/config.toml  2020-09-03 06:14:41.402469283 +0000
@@ -65,7 +65,7 @@
     disable_proc_mount = false
     [plugins."io.containerd.grpc.v1.cri".containerd]
       snapshotter = "overlayfs"
-      default_runtime_name = "runc"
+      default_runtime_name = "nvidia-container-runtime"
       no_pivot = false
       [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
         runtime_type = ""
@@ -78,6 +78,11 @@
         runtime_root = ""
         privileged_without_host_devices = false
       [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
+        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-container-runtime]
+          runtime_type = "io.containerd.runtime.v1.linux"
+          runtime_engine = ""
+          runtime_root = ""
+          privileged_without_host_devices = false
         [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
           runtime_type = "io.containerd.runc.v1"
           runtime_engine = ""
@@ -105,7 +110,7 @@
     no_prometheus = false
   [plugins."io.containerd.runtime.v1.linux"]
     shim = "containerd-shim"
-    runtime = "runc"
+    runtime = "nvidia-container-runtime"
     runtime_root = ""
     no_shim = false
     shim_debug = false
[plugins.linux]
   runtime = "nvidia-container-runtime"
[plugins.cri]
  disable_tcp_service = true
  [plugins.cri.containerd]
    default_runtime_name = "nvidia-container-runtime"
      [plugins.cri.containerd.runtimes]
        [plugins.cri.containerd.runtimes.nvidia-container-runtime]
          runtime_type = "io.containerd.runtime.v1.linux"
          runtime_engine = ""
          runtime_root = ""
          privileged_without_host_devices = false

klueska avatar Oct 16 '20 16:10 klueska

Thanks @klueska! I'm a little embarrassed I missed that minor detail that microk8s doesn't use docker...

Here's my current /etc/containerd/config.toml (everything that's not a comment):

[plugins.linux]
  runtime = "nvidia-container-runtime"
[plugins.cri]
  disable_tcp_service = true
  [plugins.cri.containerd]
    default_runtime_name = "nvidia-container-runtime"
      [plugins.cri.containerd.runtimes]
        [plugins.cri.containerd.runtimes.nvidia-container-runtime]
          runtime_type = "io.containerd.runtime.v1.linux"
          runtime_engine = ""
          runtime_root = ""
          privileged_without_host_devices = false

And here's the output of containerd config dump (sorry, I tried to put this in a <details> but the spacing got jacked up):

version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
plugin_dir = ""
disabled_plugins = []
required_plugins = []
oom_score = 0
imports = ["/etc/containerd/config.toml"]

[grpc]
  address = "/run/containerd/containerd.sock"
  tcp_address = ""
  tcp_tls_cert = ""
  tcp_tls_key = ""
  uid = 0
  gid = 0
  max_recv_message_size = 16777216
  max_send_message_size = 16777216

[ttrpc]
  address = ""
  uid = 0
  gid = 0

[debug]
  address = ""
  uid = 0
  gid = 0
  level = ""

[metrics]
  address = ""
  grpc_histogram = false

[cgroup]
  path = ""

[timeouts]
  "io.containerd.timeout.shim.cleanup" = "5s"
  "io.containerd.timeout.shim.load" = "5s"
  "io.containerd.timeout.shim.shutdown" = "3s"
  "io.containerd.timeout.task.state" = "2s"

[plugins]
  [plugins."io.containerd.gc.v1.scheduler"]
    pause_threshold = 0.02
    deletion_threshold = 0
    mutation_threshold = 100
    schedule_delay = "0s"
    startup_delay = "100ms"
  [plugins."io.containerd.grpc.v1.cri"]
    disable_tcp_service = true
    stream_server_address = "127.0.0.1"
    stream_server_port = "0"
    stream_idle_timeout = "4h0m0s"
    enable_selinux = false
    sandbox_image = "k8s.gcr.io/pause:3.1"
    stats_collect_period = 10
    systemd_cgroup = false
    enable_tls_streaming = false
    max_container_log_line_size = 16384
    disable_cgroup = false
    disable_apparmor = false
    restrict_oom_score_adj = false
    max_concurrent_downloads = 3
    disable_proc_mount = false
    [plugins."io.containerd.grpc.v1.cri".containerd]
      snapshotter = "overlayfs"
      default_runtime_name = "nvidia-container-runtime"
      no_pivot = false
      [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
        runtime_type = ""
        runtime_engine = ""
        runtime_root = ""
        privileged_without_host_devices = false
      [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
        runtime_type = ""
        runtime_engine = ""
        runtime_root = ""
        privileged_without_host_devices = false
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-container-runtime]
          runtime_type = "io.containerd.runtime.v1.linux"
          runtime_engine = ""
          runtime_root = ""
          privileged_without_host_devices = false
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v1"
          runtime_engine = ""
          runtime_root = ""
          privileged_without_host_devices = false
    [plugins."io.containerd.grpc.v1.cri".cni]
      bin_dir = "/opt/cni/bin"
      conf_dir = "/etc/cni/net.d"
      max_conf_num = 1
      conf_template = ""
    [plugins."io.containerd.grpc.v1.cri".registry]
      [plugins."io.containerd.grpc.v1.cri".registry.mirrors]
        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
          endpoint = ["https://registry-1.docker.io"]
    [plugins."io.containerd.grpc.v1.cri".x509_key_pair_streaming]
      tls_cert_file = ""
      tls_key_file = ""
  [plugins."io.containerd.internal.v1.opt"]
    path = "/opt/containerd"
  [plugins."io.containerd.internal.v1.restart"]
    interval = "10s"
  [plugins."io.containerd.metadata.v1.bolt"]
    content_sharing_policy = "shared"
  [plugins."io.containerd.monitor.v1.cgroups"]
    no_prometheus = false
  [plugins."io.containerd.runtime.v1.linux"]
    shim = "containerd-shim"
    runtime = "nvidia-container-runtime"
    runtime_root = ""
    no_shim = false
    shim_debug = false
  [plugins."io.containerd.runtime.v2.task"]
    platforms = ["linux/amd64"]
  [plugins."io.containerd.service.v1.diff-service"]
    default = ["walking"]
  [plugins."io.containerd.snapshotter.v1.devmapper"]
    root_path = ""
    pool_name = ""
    base_image_size = ""

I'm still seeing the same error as before. It strange that there's no plugins.cri section in containerd config dump, but at the same time, it seems to have picked up on the nvidia-container-runtime in several spots. Any thoughts?

wbadart avatar Oct 19 '20 23:10 wbadart

I'm not that familiar with the containrd config file (the nvidia-container-toolkit on which the k8s-device-plugin relies does not yet have official support for containerd). The examples I posted above are from a few cases where I saw people getting it to work. We should be adding official containerd support to the nvidia-container-toolkit soon (i.e. within the next 2 months), but until then I don't have any better suggestions than -- play around with the config and see if you can get it to work.

I'd be interested to see what you come up with.

I know @tvansteenburgh from Canonical was playing around with this as well (specifically for the use case of getting microk8s to run with GPUs), but I'm not sure how far along he's gotten.

klueska avatar Oct 21 '20 23:10 klueska

Just out of curiousity, I assume you already ran:

microk8s enable gpu

It's not strictly necessary, but doing so will setup everything you need (including your continerd config) to allow GPUs to be used.

One of which is to enable the nvidia-container-runtime as the default runtime in the containerd config used by microk8s: /var/snap/microk8s/current/args/containerd.toml

klueska avatar Oct 23 '20 01:10 klueska

I did give microk8s enable gpu a shot; it seems to be roughly equivalent to applying the daemonset config from this repo manually.

My (admittedly not super satisfying) resolution to this was to switch from microk8s to k3s, which can be configured to use docker rather than containerd. Thanks for pointing me in the right direction!

wbadart avatar Nov 02 '20 16:11 wbadart

Hi there,

Do you happen to be run this on a hybrid GPU (e.g. laptop combining intel and nvidia)?

joedborg avatar Nov 13 '20 18:11 joedborg

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Feb 29 '24 04:02 github-actions[bot]

This issue was automatically closed due to inactivity.

github-actions[bot] avatar Mar 31 '24 04:03 github-actions[bot]