nvidia-docker
nvidia-docker copied to clipboard
Rootless podman 'Failed to initialize NVML: Insufficient Permissions' on OpenSUSE Tumbleweed
1. Issue
$ podman run --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ nvidia/cuda:11.0-base nvidia-smi
Failed to initialize NVML: Insufficient Permissions
On OpenSUSE Tumbleweed fwiw.
2. Steps to reproduce the issue
$ nvidia-smi and $ sudo nvidia-smi work.
$ cat /etc/nvidia-container-runtime/config.toml
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false
[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
no-cgroups = false
#no-cgroups = true
#user = "root:video"
user = "root:root"
ldconfig = "@/sbin/ldconfig"
#ldconfig = "/sbin/ldconfig"
[nvidia-container-runtime]
debug = "/var/log/nvidia-container-runtime.log"
#debug = "/tmp/nvidia-container-runtime.log"
$ sudo podman run --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ nvidia/cuda:11.0-base nvidia-smi
Tue Feb 15 15:54:56 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro P400 Off | 00000000:01:00.0 Off | N/A |
| 34% 23C P8 N/A / N/A | 2MiB / 2000MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
After toggling no-cgroups = false to no-cgroups = true:
$ podman run --log-level=info --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ nvidia/cuda:11.0-base
INFO[0000] podman filtering at log level info
INFO[0000] Found CNI network podman (type=bridge) at /home/[me]/.config/cni/net.d/87-podman.conflist
INFO[0000] Setting parallel job count to 25
INFO[0000] Running conmon under slice user.slice and unitName libpod-conmon-0418f928fa7a07a3556432a296aa4ad39c33a716309117f20367f130c7a34b48.scope
INFO[0000] Got Conmon PID as 12406
$ podman run --log-level=info --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ nvidia/cuda:11.0-base nvidia-smi
INFO[0000] podman filtering at log level info
INFO[0000] Found CNI network podman (type=bridge) at /home/[me]/.config/cni/net.d/87-podman.conflist
INFO[0000] Setting parallel job count to 25
INFO[0000] Running conmon under slice user.slice and unitName libpod-conmon-41c656a076287283f96001ffe442d4bb077993a46553167120c07d7b8c532861.scope
INFO[0000] Got Conmon PID as 12581
Failed to initialize NVML: Insufficient Permissions
3. Information to attach (optional if deemed irrelevant)
- [x] Some nvidia-container information:
nvidia-container-cli -k -d /dev/tty info
$ nvidia-container-cli -k -d /dev/tty info
-- WARNING, the following logs are for debugging purposes only --
I0215 15:58:47.072670 12695 nvc.c:376] initializing library context (version=1.8.0, build=05959222fe4ce312c121f30c9334157ecaaee260)
I0215 15:58:47.072790 12695 nvc.c:350] using root /
I0215 15:58:47.072823 12695 nvc.c:351] using ldcache /etc/ld.so.cache
I0215 15:58:47.072839 12695 nvc.c:352] using unprivileged user 1000:1000
I0215 15:58:47.072902 12695 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0215 15:58:47.073124 12695 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W0215 15:58:47.074559 12696 nvc.c:273] failed to set inheritable capabilities
W0215 15:58:47.074655 12696 nvc.c:274] skipping kernel modules load due to failure
I0215 15:58:47.075261 12697 rpc.c:71] starting driver rpc service
I0215 15:58:47.081207 12699 rpc.c:71] starting nvcgo rpc service
I0215 15:58:47.081744 12695 nvc_info.c:759] requesting driver information with ''
I0215 15:58:47.082662 12695 nvc_info.c:172] selecting /usr/lib64/vdpau/libvdpau_nvidia.so.470.103.01
I0215 15:58:47.082756 12695 nvc_info.c:172] selecting /usr/lib64/libnvoptix.so.470.103.01
I0215 15:58:47.082795 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-tls.so.470.103.01
I0215 15:58:47.082816 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-rtcore.so.470.103.01
I0215 15:58:47.082860 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-ptxjitcompiler.so.470.103.01
I0215 15:58:47.082879 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-opticalflow.so.470.103.01
I0215 15:58:47.082904 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-opencl.so.470.103.01
I0215 15:58:47.082925 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-ngx.so.470.103.01
I0215 15:58:47.082946 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-ml.so.470.103.01
I0215 15:58:47.082974 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-ifr.so.470.103.01
I0215 15:58:47.082992 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-glvkspirv.so.470.103.01
I0215 15:58:47.083011 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-glsi.so.470.103.01
I0215 15:58:47.083031 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-glcore.so.470.103.01
I0215 15:58:47.083052 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-fbc.so.470.103.01
I0215 15:58:47.083072 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-encode.so.470.103.01
I0215 15:58:47.083090 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-eglcore.so.470.103.01
I0215 15:58:47.083107 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-compiler.so.470.103.01
I0215 15:58:47.083125 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-cfg.so.470.103.01
I0215 15:58:47.083143 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-cbl.so.470.103.01
I0215 15:58:47.083161 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-allocator.so.470.103.01
I0215 15:58:47.083182 12695 nvc_info.c:172] selecting /usr/lib64/libnvcuvid.so.470.103.01
I0215 15:58:47.083260 12695 nvc_info.c:172] selecting /usr/lib64/libcuda.so.470.103.01
I0215 15:58:47.083306 12695 nvc_info.c:172] selecting /usr/lib64/libGLX_nvidia.so.470.103.01
I0215 15:58:47.083325 12695 nvc_info.c:172] selecting /usr/lib64/libGLESv2_nvidia.so.470.103.01
I0215 15:58:47.083344 12695 nvc_info.c:172] selecting /usr/lib64/libGLESv1_CM_nvidia.so.470.103.01
I0215 15:58:47.083361 12695 nvc_info.c:172] selecting /usr/lib64/libEGL_nvidia.so.470.103.01
I0215 15:58:47.083384 12695 nvc_info.c:172] selecting /usr/lib/vdpau/libvdpau_nvidia.so.470.103.01
I0215 15:58:47.083406 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-tls.so.470.103.01
I0215 15:58:47.083424 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-ptxjitcompiler.so.470.103.01
I0215 15:58:47.083442 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-opticalflow.so.470.103.01
I0215 15:58:47.083459 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-opencl.so.470.103.01
I0215 15:58:47.083476 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-ml.so.470.103.01
I0215 15:58:47.083495 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-ifr.so.470.103.01
I0215 15:58:47.083513 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-glvkspirv.so.470.103.01
I0215 15:58:47.083530 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-glsi.so.470.103.01
I0215 15:58:47.083547 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-glcore.so.470.103.01
I0215 15:58:47.083565 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-fbc.so.470.103.01
I0215 15:58:47.083582 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-encode.so.470.103.01
I0215 15:58:47.083599 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-eglcore.so.470.103.01
I0215 15:58:47.083617 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-compiler.so.470.103.01
I0215 15:58:47.083636 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-allocator.so.470.103.01
I0215 15:58:47.083655 12695 nvc_info.c:172] selecting /usr/lib/libnvcuvid.so.470.103.01
I0215 15:58:47.083680 12695 nvc_info.c:172] selecting /usr/lib/libcuda.so.470.103.01
I0215 15:58:47.083707 12695 nvc_info.c:172] selecting /usr/lib/libGLX_nvidia.so.470.103.01
I0215 15:58:47.083726 12695 nvc_info.c:172] selecting /usr/lib/libGLESv2_nvidia.so.470.103.01
I0215 15:58:47.083744 12695 nvc_info.c:172] selecting /usr/lib/libGLESv1_CM_nvidia.so.470.103.01
I0215 15:58:47.083763 12695 nvc_info.c:172] selecting /usr/lib/libEGL_nvidia.so.470.103.01
W0215 15:58:47.083773 12695 nvc_info.c:398] missing library libnvidia-nscq.so
W0215 15:58:47.083777 12695 nvc_info.c:398] missing library libnvidia-fatbinaryloader.so
W0215 15:58:47.083781 12695 nvc_info.c:398] missing library libnvidia-pkcs11.so
W0215 15:58:47.083785 12695 nvc_info.c:402] missing compat32 library libnvidia-cfg.so
W0215 15:58:47.083789 12695 nvc_info.c:402] missing compat32 library libnvidia-nscq.so
W0215 15:58:47.083793 12695 nvc_info.c:402] missing compat32 library libnvidia-fatbinaryloader.so
W0215 15:58:47.083796 12695 nvc_info.c:402] missing compat32 library libnvidia-pkcs11.so
W0215 15:58:47.083799 12695 nvc_info.c:402] missing compat32 library libnvidia-ngx.so
W0215 15:58:47.083802 12695 nvc_info.c:402] missing compat32 library libnvidia-rtcore.so
W0215 15:58:47.083805 12695 nvc_info.c:402] missing compat32 library libnvoptix.so
W0215 15:58:47.083808 12695 nvc_info.c:402] missing compat32 library libnvidia-cbl.so
I0215 15:58:47.083959 12695 nvc_info.c:298] selecting /usr/bin/nvidia-smi
I0215 15:58:47.083971 12695 nvc_info.c:298] selecting /usr/bin/nvidia-debugdump
I0215 15:58:47.083982 12695 nvc_info.c:298] selecting /usr/bin/nvidia-persistenced
I0215 15:58:47.083996 12695 nvc_info.c:298] selecting /usr/bin/nvidia-cuda-mps-control
I0215 15:58:47.084006 12695 nvc_info.c:298] selecting /usr/bin/nvidia-cuda-mps-server
W0215 15:58:47.084016 12695 nvc_info.c:424] missing binary nv-fabricmanager
I0215 15:58:47.084032 12695 nvc_info.c:342] listing firmware path /usr/lib/firmware/nvidia/470.103.01/gsp.bin
I0215 15:58:47.084045 12695 nvc_info.c:522] listing device /dev/nvidiactl
I0215 15:58:47.084048 12695 nvc_info.c:522] listing device /dev/nvidia-uvm
I0215 15:58:47.084052 12695 nvc_info.c:522] listing device /dev/nvidia-uvm-tools
I0215 15:58:47.084055 12695 nvc_info.c:522] listing device /dev/nvidia-modeset
W0215 15:58:47.084068 12695 nvc_info.c:348] missing ipc path /var/run/nvidia-persistenced/socket
W0215 15:58:47.084080 12695 nvc_info.c:348] missing ipc path /var/run/nvidia-fabricmanager/socket
W0215 15:58:47.084090 12695 nvc_info.c:348] missing ipc path /tmp/nvidia-mps
I0215 15:58:47.084093 12695 nvc_info.c:815] requesting device information with ''
I0215 15:58:47.089567 12695 nvc_info.c:706] listing device /dev/nvidia0 (GPU-08283365-4b53-3311-bff5-d5c37f82021d at 00000000:01:00.0)
NVRM version: 470.103.01
CUDA version: 11.4
Device Index: 0
Device Minor: 0
Model: Quadro P400
Brand: Quadro
GPU UUID: GPU-08283365-4b53-3311-bff5-d5c37f82021d
Bus Location: 00000000:01:00.0
Architecture: 6.1
I0215 15:58:47.089598 12695 nvc.c:430] shutting down library context
I0215 15:58:47.089625 12699 rpc.c:95] terminating nvcgo rpc service
I0215 15:58:47.089927 12695 rpc.c:135] nvcgo rpc service terminated successfully
I0215 15:58:47.090430 12697 rpc.c:95] terminating driver rpc service
I0215 15:58:47.090551 12695 rpc.c:135] driver rpc service terminated successfully
- [x] Kernel version from
uname -a
Linux satellite 5.16.8-1-default #1 SMP PREEMPT Thu Feb 10 11:31:59 UTC 2022 (5d1f5d2) x86_64 x86_64 x86_64 GNU/Linux
- [x] Driver information from
nvidia-smi -a
$ nvidia-smi -a
==============NVSMI LOG==============
Timestamp : Tue Feb 15 17:01:29 2022
Driver Version : 470.103.01
CUDA Version : 11.4
Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : Quadro P400
Product Brand : Quadro
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1422521034591
GPU UUID : GPU-08283365-4b53-3311-bff5-d5c37f82021d
Minor Number : 0
VBIOS Version : 86.07.8F.00.02
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : 900-5G178-1701-000
Module ID : 0
Inforom Version
Image Version : G178.0500.00.02
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x1CB310DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x11BE10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 34 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 2000 MiB
Used : 2 MiB
Free : 1998 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 4 MiB
Free : 252 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
- [x] Podman version from
podman version
$ podman version
Version: 3.4.4
API Version: 3.4.4
Go Version: go1.13.15
Built: Thu Dec 9 01:00:00 2021
OS/Arch: linux/amd64
- [x] NVIDIA container library version from
nvidia-container-cli -V
$ nvidia-container-cli -V
cli-version: 1.8.0
lib-version: 1.8.0
build date: 2022-02-04T09:21+00:00
build revision: 05959222fe4ce312c121f30c9334157ecaaee260
build compiler: gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
If you set no-cgroups=true then nvidia-docker will not set up the cgroups for any of your GPUs and NVML will not be able to attach to talk to them (unless you pass the device references yourself on the podman command line). Can you explain what you are trying to do by turning this flag on?
I was following the steps as laid out here:
To be able to run rootless containers with
podman, we need the following configuration change to the NVIDIA runtime:sudo sed -i 's/^#no-cgroups = false/no-cgroups = true/;' /etc/nvidia-container-runtime/config.toml
Without toggling cgroups:
$ podman run --log-level=info --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ nvidia/cuda:11.0-base nvidia-smi
INFO[0000] podman filtering at log level info
INFO[0000] Found CNI network podman (type=bridge) at /home/[me]/.config/cni/net.d/87-podman.conflist
INFO[0000] Setting parallel job count to 25
INFO[0000] Running conmon under slice user.slice and unitName libpod-conmon-dba911758792809762f61b4a5819b849b63a003176bf052674c5b9b533ea701e.scope
Error: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: failed to add device rules: unable to find any existing device filters attached to the cgroup: bpf_prog_query(BPF_CGROUP_DEVICE) failed: operation not permitted: OCI permission denied
E:
(unless you pass the device references yourself on the podman command line)
If this is the solution, how do I pass the device references?
Ah right, with podman you can have no-cgroup=true and not explicity pass the device list (because podman will infer the cgroups that need to be setup for the bind-mounted dev files that get passed in). With docker / containerd this is not the case, and I got confused.
Given the error that you have, it appears that you are attempting to run this on a system with cgroupv2 enabled. The fact that you set no-cgroups = true though, means that you should not be going down this path (unless of course there is a bug in the code that allows this).
I will need to look into it further and get back to you.
For the record, I have no preference for running with no-cgroups=true. If rootless can work without, that is fine by me.
Can you give me a bit more info about your setup? I spun up an ubuntu21.10 image on AWS, installed podman on it, set no-cgroups=true and was able to get things working as expected:
$ cat /etc/nvidia-container-runtime/config.toml
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false
[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
no-cgroups = true
#user = "root:video"
ldconfig = "@/sbin/ldconfig.real"
[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
$ podman run --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ docker.io/nvidia/cuda:11.0-base nvidia-smi
Wed Feb 16 12:24:10 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:1E.0 Off | 0 |
| N/A 32C P0 36W / 300W | 0MiB / 16384MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I didn't try on OpenSUSE Tumbleweed, but I can't imagine what might be different there in terms of this.
Can you update to version 1.8.1 of the toolkit (released on Monday) and see if one of the bugs we fixed there was relevant to your issue?
Can you give me a bit more info about your setup?
Specifically this is on a system running the transactional-server Tumbleweed, so with a immutable root. (This caused a issue which was recently fixed where some mounts caused a error when it could not be mounted rw.) To install nvidia-container-toolkit I had to add the 15.1 repo (which I swear was called 15.x when I did) from here: https://nvidia.github.io/nvidia-docker/.
$ zypper lr -pr
# | Alias | Name | Enabled | GPG Check | Refresh | Priority
---+---------------------------------------+---------------------------------------+---------+-----------+---------+---------
1 | NVIDIA | NVIDIA | Yes | (r ) Yes | Yes | 99
2 | libnvidia-container | libnvidia-container | Yes | (r ) Yes | No | 99
3 | libnvidia-container-experimental | libnvidia-container-experimental | No | ---- | ---- | 99
4 | nvidia-container-runtime | nvidia-container-runtime | Yes | (r ) Yes | No | 99
5 | nvidia-container-runtime-experimental | nvidia-container-runtime-experimental | No | ---- | ---- | 99
6 | nvidia-docker | nvidia-docker | Yes | (r ) Yes | No | 99
7 | openSUSE-20211107-0 | openSUSE-20211107-0 | No | ---- | ---- | 99
8 | repo-debug | openSUSE-Tumbleweed-Debug | No | ---- | ---- | 99
9 | repo-non-oss | openSUSE-Tumbleweed-Non-Oss | Yes | (r ) Yes | Yes | 99
10 | repo-oss | openSUSE-Tumbleweed-Oss | Yes | (r ) Yes | Yes | 99
11 | repo-source | openSUSE-Tumbleweed-Source | No | ---- | ---- | 99
12 | repo-update | openSUSE-Tumbleweed-Update | Yes | (r ) Yes | Yes | 99
$ zypper se -s nvidia-container-toolkit
Loading repository data...
Reading installed packages...
S | Name | Type | Version | Arch | Repository
--+--------------------------+---------+---------+--------+-------------------------
i | nvidia-container-toolkit | package | 1.8.1-1 | x86_64 | libnvidia-container
v | nvidia-container-toolkit | package | 1.8.0-1 | x86_64 | libnvidia-container
v | nvidia-container-toolkit | package | 1.7.0-1 | x86_64 | libnvidia-container
v | nvidia-container-toolkit | package | 1.6.0-1 | x86_64 | libnvidia-container
v | nvidia-container-toolkit | package | 1.5.1-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.5.0-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.4.2-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.4.1-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.4.0-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.3.0-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.2.1-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.2.0-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.1.2-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.1.1-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.1.0-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.0.5-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.0.4-1 | x86_64 | nvidia-container-runtime
Can you update to version
1.8.1of the toolkit (released on Monday) and see if one of the bugs we fixed there was relevant to your issue?
$ nvidia-container-cli -V
cli-version: 1.8.1
lib-version: 1.8.1
build date: 2022-02-14T12:05+00:00
build revision: abd4e14d8cb923e2a70b7dcfee55fbc16bffa353
build compiler: gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
$ podman run --log-level=info --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ nvidia/cuda:11.0-base nvidia-smi
INFO[0000] podman filtering at log level info
INFO[0000] Found CNI network podman (type=bridge) at /home/[me]/.config/cni/net.d/87-podman.conflist
INFO[0000] Setting parallel job count to 25
INFO[0000] Running conmon under slice user.slice and unitName libpod-conmon-f74ea1c8a0116a3bcb970f03eb97b400c65b80a77dfefb997f6faf776c3e1982.scope
INFO[0004] Got Conmon PID as 27212
Failed to initialize NVML: Insufficient Permissions
INFO[0004] Container f74ea1c8a0116a3bcb970f03eb97b400c65b80a77dfefb997f6faf776c3e1982 was already removed, skipping --rm
I will try and bring up a similar setup soon. Could you check if maybe this is relevant in the meantime: https://github.com/NVIDIA/nvidia-docker/issues/1547#issuecomment-1041565769
I had seen that issue and I got the best results (I think) while using user=root:root. I see that in your attempt you do not specify a user, haven't tried that yet.
$ ls -l /dev/nvidia*
crw-rw---- 1 root video 195, 0 Feb 15 19:37 /dev/nvidia0
crw-rw---- 1 root video 195, 255 Feb 15 19:37 /dev/nvidiactl
crw-rw---- 1 root video 195, 254 Feb 15 19:37 /dev/nvidia-modeset
crw-rw---- 1 root video 238, 0 Feb 15 19:37 /dev/nvidia-uvm
crw-rw---- 1 root video 238, 1 Feb 15 19:37 /dev/nvidia-uvm-tools
$ groups
[me] users wheel video
With user=root:video and without a user specified i get the same result:
$ podman run --log-level=info --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ nvidia/cuda:11.0-base nvidia-smi
INFO[0000] podman filtering at log level info
INFO[0000] Found CNI network podman (type=bridge) at /home/[me]/.config/cni/net.d/87-podman.conflist
INFO[0000] Setting parallel job count to 25
INFO[0000] Running conmon under slice user.slice and unitName libpod-conmon-94a161a6829d7307ffb7ff7288f8ebe83475cf37e65bd487c425081f84ac9252.scope
Error: OCI runtime error: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: insufficient permissions
$ podman run --log-level=info --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ nvidia/cuda:11.0-base
INFO[0000] podman filtering at log level info
INFO[0000] Found CNI network podman (type=bridge) at /home/[me]/.config/cni/net.d/87-podman.conflist
INFO[0000] Setting parallel job count to 25
INFO[0000] Running conmon under slice user.slice and unitName libpod-conmon-46c6aada867c650652a20e26ad81001dd592029b6a319a3f7b520afc309e7a18.scope
Error: OCI runtime error: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: insufficient permissions