nvidia-docker icon indicating copy to clipboard operation
nvidia-docker copied to clipboard

Rootless podman 'Failed to initialize NVML: Insufficient Permissions' on OpenSUSE Tumbleweed

Open RlndVt opened this issue 3 years ago • 8 comments

1. Issue

$ podman run --rm --security-opt=label=disable      --hooks-dir=/usr/share/containers/oci/hooks.d/      nvidia/cuda:11.0-base nvidia-smi
Failed to initialize NVML: Insufficient Permissions

On OpenSUSE Tumbleweed fwiw.

2. Steps to reproduce the issue

$ nvidia-smi and $ sudo nvidia-smi work.

$ cat /etc/nvidia-container-runtime/config.toml
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
no-cgroups = false
#no-cgroups = true
#user = "root:video"
user = "root:root"
ldconfig = "@/sbin/ldconfig"
#ldconfig = "/sbin/ldconfig"

[nvidia-container-runtime]
debug = "/var/log/nvidia-container-runtime.log"
#debug = "/tmp/nvidia-container-runtime.log"
$ sudo podman run --rm --security-opt=label=disable      --hooks-dir=/usr/share/containers/oci/hooks.d/      nvidia/cuda:11.0-base nvidia-smi
Tue Feb 15 15:54:56 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P400         Off  | 00000000:01:00.0 Off |                  N/A |
| 34%   23C    P8    N/A /  N/A |      2MiB /  2000MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

After toggling no-cgroups = false to no-cgroups = true:

$ podman run --log-level=info --rm --security-opt=label=disable      --hooks-dir=/usr/share/containers/oci/hooks.d/      nvidia/cuda:11.0-base
INFO[0000] podman filtering at log level info           
INFO[0000] Found CNI network podman (type=bridge) at /home/[me]/.config/cni/net.d/87-podman.conflist 
INFO[0000] Setting parallel job count to 25             
INFO[0000] Running conmon under slice user.slice and unitName libpod-conmon-0418f928fa7a07a3556432a296aa4ad39c33a716309117f20367f130c7a34b48.scope 
INFO[0000] Got Conmon PID as 12406  
$ podman run --log-level=info --rm --security-opt=label=disable      --hooks-dir=/usr/share/containers/oci/hooks.d/      nvidia/cuda:11.0-base nvidia-smi
INFO[0000] podman filtering at log level info           
INFO[0000] Found CNI network podman (type=bridge) at /home/[me]/.config/cni/net.d/87-podman.conflist 
INFO[0000] Setting parallel job count to 25             
INFO[0000] Running conmon under slice user.slice and unitName libpod-conmon-41c656a076287283f96001ffe442d4bb077993a46553167120c07d7b8c532861.scope 
INFO[0000] Got Conmon PID as 12581                      
Failed to initialize NVML: Insufficient Permissions

3. Information to attach (optional if deemed irrelevant)

  • [x] Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
$ nvidia-container-cli -k -d /dev/tty info

-- WARNING, the following logs are for debugging purposes only --

I0215 15:58:47.072670 12695 nvc.c:376] initializing library context (version=1.8.0, build=05959222fe4ce312c121f30c9334157ecaaee260)
I0215 15:58:47.072790 12695 nvc.c:350] using root /
I0215 15:58:47.072823 12695 nvc.c:351] using ldcache /etc/ld.so.cache
I0215 15:58:47.072839 12695 nvc.c:352] using unprivileged user 1000:1000
I0215 15:58:47.072902 12695 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0215 15:58:47.073124 12695 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W0215 15:58:47.074559 12696 nvc.c:273] failed to set inheritable capabilities
W0215 15:58:47.074655 12696 nvc.c:274] skipping kernel modules load due to failure
I0215 15:58:47.075261 12697 rpc.c:71] starting driver rpc service
I0215 15:58:47.081207 12699 rpc.c:71] starting nvcgo rpc service
I0215 15:58:47.081744 12695 nvc_info.c:759] requesting driver information with ''
I0215 15:58:47.082662 12695 nvc_info.c:172] selecting /usr/lib64/vdpau/libvdpau_nvidia.so.470.103.01
I0215 15:58:47.082756 12695 nvc_info.c:172] selecting /usr/lib64/libnvoptix.so.470.103.01
I0215 15:58:47.082795 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-tls.so.470.103.01
I0215 15:58:47.082816 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-rtcore.so.470.103.01
I0215 15:58:47.082860 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-ptxjitcompiler.so.470.103.01
I0215 15:58:47.082879 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-opticalflow.so.470.103.01
I0215 15:58:47.082904 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-opencl.so.470.103.01
I0215 15:58:47.082925 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-ngx.so.470.103.01
I0215 15:58:47.082946 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-ml.so.470.103.01
I0215 15:58:47.082974 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-ifr.so.470.103.01
I0215 15:58:47.082992 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-glvkspirv.so.470.103.01
I0215 15:58:47.083011 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-glsi.so.470.103.01
I0215 15:58:47.083031 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-glcore.so.470.103.01
I0215 15:58:47.083052 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-fbc.so.470.103.01
I0215 15:58:47.083072 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-encode.so.470.103.01
I0215 15:58:47.083090 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-eglcore.so.470.103.01
I0215 15:58:47.083107 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-compiler.so.470.103.01
I0215 15:58:47.083125 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-cfg.so.470.103.01
I0215 15:58:47.083143 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-cbl.so.470.103.01
I0215 15:58:47.083161 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-allocator.so.470.103.01
I0215 15:58:47.083182 12695 nvc_info.c:172] selecting /usr/lib64/libnvcuvid.so.470.103.01
I0215 15:58:47.083260 12695 nvc_info.c:172] selecting /usr/lib64/libcuda.so.470.103.01
I0215 15:58:47.083306 12695 nvc_info.c:172] selecting /usr/lib64/libGLX_nvidia.so.470.103.01
I0215 15:58:47.083325 12695 nvc_info.c:172] selecting /usr/lib64/libGLESv2_nvidia.so.470.103.01
I0215 15:58:47.083344 12695 nvc_info.c:172] selecting /usr/lib64/libGLESv1_CM_nvidia.so.470.103.01
I0215 15:58:47.083361 12695 nvc_info.c:172] selecting /usr/lib64/libEGL_nvidia.so.470.103.01
I0215 15:58:47.083384 12695 nvc_info.c:172] selecting /usr/lib/vdpau/libvdpau_nvidia.so.470.103.01
I0215 15:58:47.083406 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-tls.so.470.103.01
I0215 15:58:47.083424 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-ptxjitcompiler.so.470.103.01
I0215 15:58:47.083442 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-opticalflow.so.470.103.01
I0215 15:58:47.083459 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-opencl.so.470.103.01
I0215 15:58:47.083476 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-ml.so.470.103.01
I0215 15:58:47.083495 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-ifr.so.470.103.01
I0215 15:58:47.083513 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-glvkspirv.so.470.103.01
I0215 15:58:47.083530 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-glsi.so.470.103.01
I0215 15:58:47.083547 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-glcore.so.470.103.01
I0215 15:58:47.083565 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-fbc.so.470.103.01
I0215 15:58:47.083582 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-encode.so.470.103.01
I0215 15:58:47.083599 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-eglcore.so.470.103.01
I0215 15:58:47.083617 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-compiler.so.470.103.01
I0215 15:58:47.083636 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-allocator.so.470.103.01
I0215 15:58:47.083655 12695 nvc_info.c:172] selecting /usr/lib/libnvcuvid.so.470.103.01
I0215 15:58:47.083680 12695 nvc_info.c:172] selecting /usr/lib/libcuda.so.470.103.01
I0215 15:58:47.083707 12695 nvc_info.c:172] selecting /usr/lib/libGLX_nvidia.so.470.103.01
I0215 15:58:47.083726 12695 nvc_info.c:172] selecting /usr/lib/libGLESv2_nvidia.so.470.103.01
I0215 15:58:47.083744 12695 nvc_info.c:172] selecting /usr/lib/libGLESv1_CM_nvidia.so.470.103.01
I0215 15:58:47.083763 12695 nvc_info.c:172] selecting /usr/lib/libEGL_nvidia.so.470.103.01
W0215 15:58:47.083773 12695 nvc_info.c:398] missing library libnvidia-nscq.so
W0215 15:58:47.083777 12695 nvc_info.c:398] missing library libnvidia-fatbinaryloader.so
W0215 15:58:47.083781 12695 nvc_info.c:398] missing library libnvidia-pkcs11.so
W0215 15:58:47.083785 12695 nvc_info.c:402] missing compat32 library libnvidia-cfg.so
W0215 15:58:47.083789 12695 nvc_info.c:402] missing compat32 library libnvidia-nscq.so
W0215 15:58:47.083793 12695 nvc_info.c:402] missing compat32 library libnvidia-fatbinaryloader.so
W0215 15:58:47.083796 12695 nvc_info.c:402] missing compat32 library libnvidia-pkcs11.so
W0215 15:58:47.083799 12695 nvc_info.c:402] missing compat32 library libnvidia-ngx.so
W0215 15:58:47.083802 12695 nvc_info.c:402] missing compat32 library libnvidia-rtcore.so
W0215 15:58:47.083805 12695 nvc_info.c:402] missing compat32 library libnvoptix.so
W0215 15:58:47.083808 12695 nvc_info.c:402] missing compat32 library libnvidia-cbl.so
I0215 15:58:47.083959 12695 nvc_info.c:298] selecting /usr/bin/nvidia-smi
I0215 15:58:47.083971 12695 nvc_info.c:298] selecting /usr/bin/nvidia-debugdump
I0215 15:58:47.083982 12695 nvc_info.c:298] selecting /usr/bin/nvidia-persistenced
I0215 15:58:47.083996 12695 nvc_info.c:298] selecting /usr/bin/nvidia-cuda-mps-control
I0215 15:58:47.084006 12695 nvc_info.c:298] selecting /usr/bin/nvidia-cuda-mps-server
W0215 15:58:47.084016 12695 nvc_info.c:424] missing binary nv-fabricmanager
I0215 15:58:47.084032 12695 nvc_info.c:342] listing firmware path /usr/lib/firmware/nvidia/470.103.01/gsp.bin
I0215 15:58:47.084045 12695 nvc_info.c:522] listing device /dev/nvidiactl
I0215 15:58:47.084048 12695 nvc_info.c:522] listing device /dev/nvidia-uvm
I0215 15:58:47.084052 12695 nvc_info.c:522] listing device /dev/nvidia-uvm-tools
I0215 15:58:47.084055 12695 nvc_info.c:522] listing device /dev/nvidia-modeset
W0215 15:58:47.084068 12695 nvc_info.c:348] missing ipc path /var/run/nvidia-persistenced/socket
W0215 15:58:47.084080 12695 nvc_info.c:348] missing ipc path /var/run/nvidia-fabricmanager/socket
W0215 15:58:47.084090 12695 nvc_info.c:348] missing ipc path /tmp/nvidia-mps
I0215 15:58:47.084093 12695 nvc_info.c:815] requesting device information with ''
I0215 15:58:47.089567 12695 nvc_info.c:706] listing device /dev/nvidia0 (GPU-08283365-4b53-3311-bff5-d5c37f82021d at 00000000:01:00.0)
NVRM version:   470.103.01
CUDA version:   11.4

Device Index:   0
Device Minor:   0
Model:          Quadro P400
Brand:          Quadro
GPU UUID:       GPU-08283365-4b53-3311-bff5-d5c37f82021d
Bus Location:   00000000:01:00.0
Architecture:   6.1
I0215 15:58:47.089598 12695 nvc.c:430] shutting down library context
I0215 15:58:47.089625 12699 rpc.c:95] terminating nvcgo rpc service
I0215 15:58:47.089927 12695 rpc.c:135] nvcgo rpc service terminated successfully
I0215 15:58:47.090430 12697 rpc.c:95] terminating driver rpc service
I0215 15:58:47.090551 12695 rpc.c:135] driver rpc service terminated successfully
  • [x] Kernel version from uname -a
Linux satellite 5.16.8-1-default #1 SMP PREEMPT Thu Feb 10 11:31:59 UTC 2022 (5d1f5d2) x86_64 x86_64 x86_64 GNU/Linux
  • [x] Driver information from nvidia-smi -a
$ nvidia-smi -a

==============NVSMI LOG==============

Timestamp                                 : Tue Feb 15 17:01:29 2022
Driver Version                            : 470.103.01
CUDA Version                              : 11.4

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : Quadro P400
    Product Brand                         : Quadro
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1422521034591
    GPU UUID                              : GPU-08283365-4b53-3311-bff5-d5c37f82021d
    Minor Number                          : 0
    VBIOS Version                         : 86.07.8F.00.02
    MultiGPU Board                        : No
    Board ID                              : 0x100
    GPU Part Number                       : 900-5G178-1701-000
    Module ID                             : 0
    Inforom Version
        Image Version                     : G178.0500.00.02
        OEM Object                        : 1.1
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1CB310DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x11BE10DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : 34 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 2000 MiB
        Used                              : 2 MiB
        Free                              : 1998 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 4 MiB
        Free                              : 252 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            Single Bit            
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit            
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
        Aggregate
            Single Bit            
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit            
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
  • [x] Podman version from podman version
$  podman version
Version:      3.4.4
API Version:  3.4.4
Go Version:   go1.13.15
Built:        Thu Dec  9 01:00:00 2021
OS/Arch:      linux/amd64
  • [x] NVIDIA container library version from nvidia-container-cli -V
$ nvidia-container-cli -V
cli-version: 1.8.0
lib-version: 1.8.0
build date: 2022-02-04T09:21+00:00
build revision: 05959222fe4ce312c121f30c9334157ecaaee260
build compiler: gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

RlndVt avatar Feb 15 '22 16:02 RlndVt

If you set no-cgroups=true then nvidia-docker will not set up the cgroups for any of your GPUs and NVML will not be able to attach to talk to them (unless you pass the device references yourself on the podman command line). Can you explain what you are trying to do by turning this flag on?

klueska avatar Feb 15 '22 16:02 klueska

I was following the steps as laid out here:

To be able to run rootless containers with podman, we need the following configuration change to the NVIDIA runtime: sudo sed -i 's/^#no-cgroups = false/no-cgroups = true/;' /etc/nvidia-container-runtime/config.toml

Without toggling cgroups:

$ podman run --log-level=info --rm --security-opt=label=disable      --hooks-dir=/usr/share/containers/oci/hooks.d/      nvidia/cuda:11.0-base nvidia-smi
INFO[0000] podman filtering at log level info           
INFO[0000] Found CNI network podman (type=bridge) at /home/[me]/.config/cni/net.d/87-podman.conflist 
INFO[0000] Setting parallel job count to 25             
INFO[0000] Running conmon under slice user.slice and unitName libpod-conmon-dba911758792809762f61b4a5819b849b63a003176bf052674c5b9b533ea701e.scope 
Error: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: failed to add device rules: unable to find any existing device filters attached to the cgroup: bpf_prog_query(BPF_CGROUP_DEVICE) failed: operation not permitted: OCI permission denied

E:

(unless you pass the device references yourself on the podman command line)

If this is the solution, how do I pass the device references?

RlndVt avatar Feb 15 '22 16:02 RlndVt

Ah right, with podman you can have no-cgroup=true and not explicity pass the device list (because podman will infer the cgroups that need to be setup for the bind-mounted dev files that get passed in). With docker / containerd this is not the case, and I got confused.

Given the error that you have, it appears that you are attempting to run this on a system with cgroupv2 enabled. The fact that you set no-cgroups = true though, means that you should not be going down this path (unless of course there is a bug in the code that allows this).

I will need to look into it further and get back to you.

klueska avatar Feb 16 '22 11:02 klueska

For the record, I have no preference for running with no-cgroups=true. If rootless can work without, that is fine by me.

RlndVt avatar Feb 16 '22 11:02 RlndVt

Can you give me a bit more info about your setup? I spun up an ubuntu21.10 image on AWS, installed podman on it, set no-cgroups=true and was able to get things working as expected:

$ cat /etc/nvidia-container-runtime/config.toml
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
no-cgroups = true
#user = "root:video"
ldconfig = "@/sbin/ldconfig.real"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
$ podman run --rm --security-opt=label=disable      --hooks-dir=/usr/share/containers/oci/hooks.d/   docker.io/nvidia/cuda:11.0-base nvidia-smi
Wed Feb 16 12:24:10 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   32C    P0    36W / 300W |      0MiB / 16384MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I didn't try on OpenSUSE Tumbleweed, but I can't imagine what might be different there in terms of this.

Can you update to version 1.8.1 of the toolkit (released on Monday) and see if one of the bugs we fixed there was relevant to your issue?

klueska avatar Feb 16 '22 12:02 klueska

Can you give me a bit more info about your setup?

Specifically this is on a system running the transactional-server Tumbleweed, so with a immutable root. (This caused a issue which was recently fixed where some mounts caused a error when it could not be mounted rw.) To install nvidia-container-toolkit I had to add the 15.1 repo (which I swear was called 15.x when I did) from here: https://nvidia.github.io/nvidia-docker/.

$ zypper lr -pr
#  | Alias                                 | Name                                  | Enabled | GPG Check | Refresh | Priority
---+---------------------------------------+---------------------------------------+---------+-----------+---------+---------
 1 | NVIDIA                                | NVIDIA                                | Yes     | (r ) Yes  | Yes     |   99
 2 | libnvidia-container                   | libnvidia-container                   | Yes     | (r ) Yes  | No      |   99
 3 | libnvidia-container-experimental      | libnvidia-container-experimental      | No      | ----      | ----    |   99
 4 | nvidia-container-runtime              | nvidia-container-runtime              | Yes     | (r ) Yes  | No      |   99
 5 | nvidia-container-runtime-experimental | nvidia-container-runtime-experimental | No      | ----      | ----    |   99
 6 | nvidia-docker                         | nvidia-docker                         | Yes     | (r ) Yes  | No      |   99
 7 | openSUSE-20211107-0                   | openSUSE-20211107-0                   | No      | ----      | ----    |   99
 8 | repo-debug                            | openSUSE-Tumbleweed-Debug             | No      | ----      | ----    |   99
 9 | repo-non-oss                          | openSUSE-Tumbleweed-Non-Oss           | Yes     | (r ) Yes  | Yes     |   99
10 | repo-oss                              | openSUSE-Tumbleweed-Oss               | Yes     | (r ) Yes  | Yes     |   99
11 | repo-source                           | openSUSE-Tumbleweed-Source            | No      | ----      | ----    |   99
12 | repo-update                           | openSUSE-Tumbleweed-Update            | Yes     | (r ) Yes  | Yes     |   99

$ zypper se -s nvidia-container-toolkit
Loading repository data...
Reading installed packages...

S | Name                     | Type    | Version | Arch   | Repository
--+--------------------------+---------+---------+--------+-------------------------
i | nvidia-container-toolkit | package | 1.8.1-1 | x86_64 | libnvidia-container
v | nvidia-container-toolkit | package | 1.8.0-1 | x86_64 | libnvidia-container
v | nvidia-container-toolkit | package | 1.7.0-1 | x86_64 | libnvidia-container
v | nvidia-container-toolkit | package | 1.6.0-1 | x86_64 | libnvidia-container
v | nvidia-container-toolkit | package | 1.5.1-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.5.0-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.4.2-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.4.1-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.4.0-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.3.0-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.2.1-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.2.0-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.1.2-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.1.1-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.1.0-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.0.5-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.0.4-1 | x86_64 | nvidia-container-runtime

Can you update to version 1.8.1 of the toolkit (released on Monday) and see if one of the bugs we fixed there was relevant to your issue?

$ nvidia-container-cli -V
cli-version: 1.8.1
lib-version: 1.8.1
build date: 2022-02-14T12:05+00:00
build revision: abd4e14d8cb923e2a70b7dcfee55fbc16bffa353
build compiler: gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
$ podman run --log-level=info --rm --security-opt=label=disable      --hooks-dir=/usr/share/containers/oci/hooks.d/   nvidia/cuda:11.0-base nvidia-smi
INFO[0000] podman filtering at log level info           
INFO[0000] Found CNI network podman (type=bridge) at /home/[me]/.config/cni/net.d/87-podman.conflist 
INFO[0000] Setting parallel job count to 25             
INFO[0000] Running conmon under slice user.slice and unitName libpod-conmon-f74ea1c8a0116a3bcb970f03eb97b400c65b80a77dfefb997f6faf776c3e1982.scope 
INFO[0004] Got Conmon PID as 27212                      
Failed to initialize NVML: Insufficient Permissions
INFO[0004] Container f74ea1c8a0116a3bcb970f03eb97b400c65b80a77dfefb997f6faf776c3e1982 was already removed, skipping --rm

RlndVt avatar Feb 16 '22 13:02 RlndVt

I will try and bring up a similar setup soon. Could you check if maybe this is relevant in the meantime: https://github.com/NVIDIA/nvidia-docker/issues/1547#issuecomment-1041565769

klueska avatar Feb 16 '22 15:02 klueska

I had seen that issue and I got the best results (I think) while using user=root:root. I see that in your attempt you do not specify a user, haven't tried that yet.

$ ls -l /dev/nvidia*
crw-rw---- 1 root video 195,   0 Feb 15 19:37 /dev/nvidia0
crw-rw---- 1 root video 195, 255 Feb 15 19:37 /dev/nvidiactl
crw-rw---- 1 root video 195, 254 Feb 15 19:37 /dev/nvidia-modeset
crw-rw---- 1 root video 238,   0 Feb 15 19:37 /dev/nvidia-uvm
crw-rw---- 1 root video 238,   1 Feb 15 19:37 /dev/nvidia-uvm-tools
$ groups
[me] users wheel video

With user=root:video and without a user specified i get the same result:

$ podman run --log-level=info --rm --security-opt=label=disable      --hooks-dir=/usr/share/containers/oci/hooks.d/   nvidia/cuda:11.0-base nvidia-smi
INFO[0000] podman filtering at log level info           
INFO[0000] Found CNI network podman (type=bridge) at /home/[me]/.config/cni/net.d/87-podman.conflist 
INFO[0000] Setting parallel job count to 25             
INFO[0000] Running conmon under slice user.slice and unitName libpod-conmon-94a161a6829d7307ffb7ff7288f8ebe83475cf37e65bd487c425081f84ac9252.scope 
Error: OCI runtime error: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: insufficient permissions
$ podman run --log-level=info --rm --security-opt=label=disable      --hooks-dir=/usr/share/containers/oci/hooks.d/   nvidia/cuda:11.0-base
INFO[0000] podman filtering at log level info           
INFO[0000] Found CNI network podman (type=bridge) at /home/[me]/.config/cni/net.d/87-podman.conflist 
INFO[0000] Setting parallel job count to 25             
INFO[0000] Running conmon under slice user.slice and unitName libpod-conmon-46c6aada867c650652a20e26ad81001dd592029b6a319a3f7b520afc309e7a18.scope 
Error: OCI runtime error: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: insufficient permissions

RlndVt avatar Feb 16 '22 15:02 RlndVt