Unable to load NVML on SLES 15.3 based kubernetes with microk8s
I want to run gpu jobs on kubernetes installed with microk8s.
On host everything is fine, I can run nvidia-smi and nvidia-smi with docker, but when I try run same job with microk8s' ctr it fails.
microk8s ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.0-base cuda-11.0-base nvidia-smi
(tried with --privileged too).
ctr: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459:
container init caused: Running hook #0:: error running hook: exit status 127, stdout: ,
stderr: /usr/bin/nvidia-container-cli: symbol lookup error: /usr/bin/nvidia-container-cli: undefined symbol:
nvc_nvcaps_device_from_proc_path, version NVC_1.0: unknown
When I enable gpu support (equal to https://github.com/NVIDIA/k8s-device-plugin) Pod starts but logs another errors:
2021/08/24 15:13:13 Loading NVML
2021/08/24 15:13:13 Failed to initialize NVML: could not load NVML library.
2021/08/24 15:13:13 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2021/08/24 15:13:13 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2021/08/24 15:13:13 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
microk8s version:
microk8s v1.20.9 2361 1.20/stable
@Ursanon could you provide your containerd config (since it seems as if that is the container engine that you are using?)
Thanks for answer @elezar
My containerd.toml:
# Use config version 2 to enable new configuration fields.
version = 2
oom_score = 0
[grpc]
uid = 0
gid = 0
max_recv_message_size = 16777216
max_send_message_size = 16777216
[debug]
address = ""
uid = 0
gid = 0
[metrics]
address = "127.0.0.1:1338"
grpc_histogram = false
[cgroup]
path = ""
# The 'plugins."io.containerd.grpc.v1.cri"' table contains all of the server options.
[plugins."io.containerd.grpc.v1.cri"]
stream_server_address = "127.0.0.1"
stream_server_port = "0"
enable_selinux = false
sandbox_image = "k8s.gcr.io/pause:3.1"
stats_collect_period = 10
enable_tls_streaming = false
max_container_log_line_size = 16384
# 'plugins."io.containerd.grpc.v1.cri".containerd' contains config related to containerd
[plugins."io.containerd.grpc.v1.cri".containerd]
# snapshotter is the snapshotter used by containerd.
snapshotter = "overlayfs"
# no_pivot disables pivot-root (linux only), required when running a container in a RamDisk with runc.
# This only works for runtime type "io.containerd.runtime.v1.linux".
no_pivot = false
# default_runtime_name is the default runtime name to use.
default_runtime_name = "nvidia-container-runtime"
# 'plugins."io.containerd.grpc.v1.cri".containerd.runtimes' is a map from CRI RuntimeHandler strings, which specify types
# of runtime configurations, to the matching configurations.
# In this example, 'runc' is the RuntimeHandler string to match.
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
# runtime_type is the runtime type to use in containerd e.g. io.containerd.runtime.v1.linux
runtime_type = "io.containerd.runc.v1"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-container-runtime]
# runtime_type is the runtime type to use in containerd e.g. io.containerd.runtime.v1.linux
runtime_type = "io.containerd.runc.v1"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-container-runtime.options]
BinaryName = "nvidia-container-runtime"
# 'plugins."io.containerd.grpc.v1.cri".cni' contains config related to cni
[plugins."io.containerd.grpc.v1.cri".cni]
# bin_dir is the directory in which the binaries for the plugin is kept.
bin_dir = "/var/snap/microk8s/2361/opt/cni/bin"
# conf_dir is the directory in which the admin places a CNI conf.
conf_dir = "/var/snap/microk8s/2361/args/cni-network"
# 'plugins."io.containerd.grpc.v1.cri".registry' contains config related to the registry
[plugins."io.containerd.grpc.v1.cri".registry]
# 'plugins."io.containerd.grpc.v1.cri".registry.mirrors' are namespace to mirror mapping for all namespaces.
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
endpoint = ["https://registry-1.docker.io", ]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."localhost:32000"]
endpoint = ["http://localhost:32000"]
I have tried modified config (from nvidia-docker2 install guide) too, but it didn't help either