nvidia-docker icon indicating copy to clipboard operation
nvidia-docker copied to clipboard

Unable to load NVML on SLES 15.3 based kubernetes with microk8s

Open kbegiedza opened this issue 4 years ago • 2 comments

I want to run gpu jobs on kubernetes installed with microk8s.

On host everything is fine, I can run nvidia-smi and nvidia-smi with docker, but when I try run same job with microk8s' ctr it fails. microk8s ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.0-base cuda-11.0-base nvidia-smi (tried with --privileged too).

ctr: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459:
container init caused: Running hook #0:: error running hook: exit status 127, stdout: ,
stderr: /usr/bin/nvidia-container-cli: symbol lookup error: /usr/bin/nvidia-container-cli: undefined symbol: 
nvc_nvcaps_device_from_proc_path, version NVC_1.0: unknown

When I enable gpu support (equal to https://github.com/NVIDIA/k8s-device-plugin) Pod starts but logs another errors:

2021/08/24 15:13:13 Loading NVML
2021/08/24 15:13:13 Failed to initialize NVML: could not load NVML library.
2021/08/24 15:13:13 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2021/08/24 15:13:13 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2021/08/24 15:13:13 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start

microk8s version: microk8s v1.20.9 2361 1.20/stable

kbegiedza avatar Aug 24 '21 15:08 kbegiedza

@Ursanon could you provide your containerd config (since it seems as if that is the container engine that you are using?)

elezar avatar Aug 24 '21 15:08 elezar

Thanks for answer @elezar

My containerd.toml:

# Use config version 2 to enable new configuration fields.
version = 2
oom_score = 0

[grpc]
  uid = 0
  gid = 0
  max_recv_message_size = 16777216
  max_send_message_size = 16777216

[debug]
  address = ""
  uid = 0
  gid = 0

[metrics]
  address = "127.0.0.1:1338"
  grpc_histogram = false

[cgroup]
  path = ""


# The 'plugins."io.containerd.grpc.v1.cri"' table contains all of the server options.
[plugins."io.containerd.grpc.v1.cri"]

  stream_server_address = "127.0.0.1"
  stream_server_port = "0"
  enable_selinux = false
  sandbox_image = "k8s.gcr.io/pause:3.1"
  stats_collect_period = 10
  enable_tls_streaming = false
  max_container_log_line_size = 16384

  # 'plugins."io.containerd.grpc.v1.cri".containerd' contains config related to containerd
  [plugins."io.containerd.grpc.v1.cri".containerd]

    # snapshotter is the snapshotter used by containerd.
    snapshotter = "overlayfs"

    # no_pivot disables pivot-root (linux only), required when running a container in a RamDisk with runc.
    # This only works for runtime type "io.containerd.runtime.v1.linux".
    no_pivot = false

    # default_runtime_name is the default runtime name to use.
    default_runtime_name = "nvidia-container-runtime"

    # 'plugins."io.containerd.grpc.v1.cri".containerd.runtimes' is a map from CRI RuntimeHandler strings, which specify types
    # of runtime configurations, to the matching configurations.
    # In this example, 'runc' is the RuntimeHandler string to match.
    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
      # runtime_type is the runtime type to use in containerd e.g. io.containerd.runtime.v1.linux
      runtime_type = "io.containerd.runc.v1"

    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-container-runtime]
      # runtime_type is the runtime type to use in containerd e.g. io.containerd.runtime.v1.linux
      runtime_type = "io.containerd.runc.v1"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-container-runtime.options]
        BinaryName = "nvidia-container-runtime"

  # 'plugins."io.containerd.grpc.v1.cri".cni' contains config related to cni
  [plugins."io.containerd.grpc.v1.cri".cni]
    # bin_dir is the directory in which the binaries for the plugin is kept.
    bin_dir = "/var/snap/microk8s/2361/opt/cni/bin"

    # conf_dir is the directory in which the admin places a CNI conf.
    conf_dir = "/var/snap/microk8s/2361/args/cni-network"

  # 'plugins."io.containerd.grpc.v1.cri".registry' contains config related to the registry
  [plugins."io.containerd.grpc.v1.cri".registry]

    # 'plugins."io.containerd.grpc.v1.cri".registry.mirrors' are namespace to mirror mapping for all namespaces.
    [plugins."io.containerd.grpc.v1.cri".registry.mirrors]
      [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
        endpoint = ["https://registry-1.docker.io", ]
      [plugins."io.containerd.grpc.v1.cri".registry.mirrors."localhost:32000"]
        endpoint = ["http://localhost:32000"]

I have tried modified config (from nvidia-docker2 install guide) too, but it didn't help either

kbegiedza avatar Aug 24 '21 16:08 kbegiedza