nvidia-docker Failed to initialize NVML: Unknown Error after calling`systemctl daemon-reload`

1. Issue or feature description

Failed to initialize NVML: Unknown Error does not occurred in initial NVIDIA docker created, but it's happened after calling systemctl daemon-reload.

It works fine in

Kernel: 4.19.91 and systemd 219.

But it doesn't work in

Kernel: 5.10.23 and systemd 239.

I tried to monitor it with bpftrace:

During container startup, I can see event:

systemd, 1, c 195:* m, cri-containerd-d254b91e9b76d5e6a1b787f4fc6004f0f6318ac910360ee0
runc, 2, c 195:* m, cri-containerd-d254b91e9b76d5e6a1b787f4fc6004f0f6318ac910360ee0
runc, 1, c 195:0 rw, cri-containerd-ef460b0cb5dc0f9103ef2ed266863333eded42abedaf7960
runc, 1, c 195:254 rw, cri-containerd-ef460b0cb5dc0f9103ef2ed266863333eded42abedaf7960
runc, 1, c 195:255 rw, cri-containerd-ef460b0cb5dc0f9103ef2ed266863333eded42abedaf7960

And I can see the devicel.list in container as below:

cat /sys/fs/cgroup/devices/devices.list 
...
c 195:254 rw
c 195:0 rw

But after running systemctl daemon-reload, I find the event:

systemd, 1, c 195:* m, cri-containerd-d254b91e9b76d5e6a1b787f4fc6004f0f6318ac910360ee0

And the devicel.list in container as below:

cat /sys/fs/cgroup/devices/devices.list
...
c 195:* m

GPU device is not able be rw.

Currently I'm not able to use cgroup V2. Any suggestions about it? Thanks very much.

2. Steps to reproduce the issue

Run container

docker run --env NVIDIA_VISIBLE_DEVICES=all --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 --name test -itd nvidia/cuda:11.2.1-devel-ubuntu20.04 bash

Check nvidia-smi

docker exec -it test nvidia-smi
Thu Jun 30 12:33:57 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:07.0 Off |                    0 |
| N/A   31C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

check device cgroup

docker exec -it test cat /sys/fs/cgroup/devices/devices.list
c 136:* rwm
c 5:2 rwm
c 5:1 rwm
c 5:0 rwm
c 1:9 rwm
c 1:8 rwm
c 1:7 rwm
c 1:5 rwm
c 1:3 rwm
b *:* m
c *:* m
c 10:200 rwm
c 195:0 rwm
c 195:255 rwm
c 237:0 rwm
c 237:1 rw

call systemd reload

systemctl daemon-reload

check nvidia-smi

docker exec -it test nvidia-smi
Failed to initialize NVML: Unknown Error

check device list

docker exec -it test cat /sys/fs/cgroup/devices/devices.list
b 9:* m
b 253:* m
b 254:* m
b 259:* m
c 1:* m
c 4:* m
c 5:* m
c 7:* m
c 10:* m
c 13:* m
c 29:* m
c 128:* m
c 136:* rwm
c 162:* m
c 180:* m
c 188:* m
c 189:* m
c 195:* m

3. Information to attach (optional if deemed irrelevant)

[ x] Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
[ ] Kernel version from uname -a
[ ] Any relevant kernel output lines from dmesg
[ ] Driver information from nvidia-smi -a
[ x] Docker version from docker version
[ ] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
[ ] NVIDIA container library version from nvidia-container-cli -V
[ ] NVIDIA container library logs (see troubleshooting)
[ ] Docker command, image and tag used

nvidia-container-cli -k -d /dev/tty info

-- WARNING, the following logs are for debugging purposes only --

I0630 12:21:27.164651 124864 nvc.c:372] initializing library context (version=1.7.0, build=f37bb387ad05f6e501069d99e4135a97289faf1f)
I0630 12:21:27.164727 124864 nvc.c:346] using root /
I0630 12:21:27.164736 124864 nvc.c:347] using ldcache /etc/ld.so.cache
I0630 12:21:27.164742 124864 nvc.c:348] using unprivileged user 65534:65534
I0630 12:21:27.164767 124864 nvc.c:389] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0630 12:21:27.164915 124864 nvc.c:391] dxcore initialization failed, continuing assuming a non-WSL environment
I0630 12:21:27.166200 124865 nvc.c:274] loading kernel module nvidia
I0630 12:21:27.166344 124865 nvc.c:278] running mknod for /dev/nvidiactl
I0630 12:21:27.166383 124865 nvc.c:282] running mknod for /dev/nvidia0
I0630 12:21:27.166400 124865 nvc.c:286] running mknod for all nvcaps in /dev/nvidia-caps
I0630 12:21:27.171675 124865 nvc.c:214] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
I0630 12:21:27.171818 124865 nvc.c:214] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor
I0630 12:21:27.173614 124865 nvc.c:292] loading kernel module nvidia_uvm
I0630 12:21:27.173661 124865 nvc.c:296] running mknod for /dev/nvidia-uvm
I0630 12:21:27.173750 124865 nvc.c:301] loading kernel module nvidia_modeset
I0630 12:21:27.173783 124865 nvc.c:305] running mknod for /dev/nvidia-modeset
I0630 12:21:27.174053 124866 driver.c:101] starting driver service
I0630 12:21:27.177048 124864 nvc_info.c:758] requesting driver information with ''
I0630 12:21:27.178411 124864 nvc_info.c:171] selecting /usr/lib64/vdpau/libvdpau_nvidia.so.460.91.03
I0630 12:21:27.178525 124864 nvc_info.c:171] selecting /usr/lib64/libnvoptix.so.460.91.03
I0630 12:21:27.178580 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-tls.so.460.91.03
I0630 12:21:27.178625 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-rtcore.so.460.91.03
I0630 12:21:27.178676 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-ptxjitcompiler.so.460.91.03
I0630 12:21:27.178740 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-opticalflow.so.460.91.03
I0630 12:21:27.178802 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-opencl.so.460.91.03
I0630 12:21:27.178851 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-ngx.so.460.91.03
I0630 12:21:27.178914 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-ml.so.460.91.03
I0630 12:21:27.178979 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-ifr.so.460.91.03
I0630 12:21:27.179043 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-glvkspirv.so.460.91.03
I0630 12:21:27.179088 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-glsi.so.460.91.03
I0630 12:21:27.179136 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-glcore.so.460.91.03
I0630 12:21:27.179177 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-fbc.so.460.91.03
I0630 12:21:27.179236 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-encode.so.460.91.03
I0630 12:21:27.179311 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-eglcore.so.460.91.03
I0630 12:21:27.179352 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-compiler.so.460.91.03
I0630 12:21:27.179394 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-cfg.so.460.91.03
I0630 12:21:27.179460 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-cbl.so.460.91.03
I0630 12:21:27.179504 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-allocator.so.460.91.03
I0630 12:21:27.179570 124864 nvc_info.c:171] selecting /usr/lib64/libnvcuvid.so.460.91.03
I0630 12:21:27.179715 124864 nvc_info.c:171] selecting /usr/lib64/libcuda.so.460.91.03
I0630 12:21:27.179797 124864 nvc_info.c:171] selecting /usr/lib64/libGLX_nvidia.so.460.91.03
I0630 12:21:27.179846 124864 nvc_info.c:171] selecting /usr/lib64/libGLESv2_nvidia.so.460.91.03
I0630 12:21:27.179900 124864 nvc_info.c:171] selecting /usr/lib64/libGLESv1_CM_nvidia.so.460.91.03
I0630 12:21:27.179950 124864 nvc_info.c:171] selecting /usr/lib64/libEGL_nvidia.so.460.91.03
I0630 12:21:27.180005 124864 nvc_info.c:171] selecting /usr/lib/vdpau/libvdpau_nvidia.so.460.91.03
I0630 12:21:27.180062 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-tls.so.460.91.03
I0630 12:21:27.180106 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-ptxjitcompiler.so.460.91.03
I0630 12:21:27.180157 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-opticalflow.so.460.91.03
I0630 12:21:27.180220 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-opencl.so.460.91.03
I0630 12:21:27.180273 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-ml.so.460.91.03
I0630 12:21:27.180335 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-ifr.so.460.91.03
I0630 12:21:27.180390 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-glvkspirv.so.460.91.03
I0630 12:21:27.180441 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-glsi.so.460.91.03
I0630 12:21:27.180480 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-glcore.so.460.91.03
I0630 12:21:27.180520 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-fbc.so.460.91.03
I0630 12:21:27.180574 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-encode.so.460.91.03
I0630 12:21:27.180626 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-eglcore.so.460.91.03
I0630 12:21:27.180664 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-compiler.so.460.91.03
I0630 12:21:27.180703 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-allocator.so.460.91.03
I0630 12:21:27.180757 124864 nvc_info.c:171] selecting /usr/lib/libnvcuvid.so.460.91.03
I0630 12:21:27.180806 124864 nvc_info.c:171] selecting /usr/lib/libcuda.so.460.91.03
I0630 12:21:27.180860 124864 nvc_info.c:171] selecting /usr/lib/libGLX_nvidia.so.460.91.03
I0630 12:21:27.180908 124864 nvc_info.c:171] selecting /usr/lib/libGLESv2_nvidia.so.460.91.03
I0630 12:21:27.180954 124864 nvc_info.c:171] selecting /usr/lib/libGLESv1_CM_nvidia.so.460.91.03
I0630 12:21:27.181003 124864 nvc_info.c:171] selecting /usr/lib/libEGL_nvidia.so.460.91.03
W0630 12:21:27.181031 124864 nvc_info.c:397] missing library libnvidia-nscq.so
W0630 12:21:27.181040 124864 nvc_info.c:397] missing library libnvidia-fatbinaryloader.so
W0630 12:21:27.181048 124864 nvc_info.c:401] missing compat32 library libnvidia-cfg.so
W0630 12:21:27.181056 124864 nvc_info.c:401] missing compat32 library libnvidia-nscq.so
W0630 12:21:27.181065 124864 nvc_info.c:401] missing compat32 library libnvidia-fatbinaryloader.so
W0630 12:21:27.181074 124864 nvc_info.c:401] missing compat32 library libnvidia-ngx.so
W0630 12:21:27.181081 124864 nvc_info.c:401] missing compat32 library libnvidia-rtcore.so
W0630 12:21:27.181089 124864 nvc_info.c:401] missing compat32 library libnvoptix.so
W0630 12:21:27.181095 124864 nvc_info.c:401] missing compat32 library libnvidia-cbl.so
I0630 12:21:27.181378 124864 nvc_info.c:297] selecting /usr/bin/nvidia-smi
I0630 12:21:27.181406 124864 nvc_info.c:297] selecting /usr/bin/nvidia-debugdump
I0630 12:21:27.181434 124864 nvc_info.c:297] selecting /usr/bin/nvidia-persistenced
I0630 12:21:27.181470 124864 nvc_info.c:297] selecting /usr/bin/nvidia-cuda-mps-control
I0630 12:21:27.181494 124864 nvc_info.c:297] selecting /usr/bin/nvidia-cuda-mps-server
W0630 12:21:27.181542 124864 nvc_info.c:423] missing binary nv-fabricmanager
W0630 12:21:27.181579 124864 nvc_info.c:347] missing firmware path /lib/firmware/nvidia/460.91.03
I0630 12:21:27.181610 124864 nvc_info.c:520] listing device /dev/nvidiactl
I0630 12:21:27.181622 124864 nvc_info.c:520] listing device /dev/nvidia-uvm
I0630 12:21:27.181632 124864 nvc_info.c:520] listing device /dev/nvidia-uvm-tools
I0630 12:21:27.181640 124864 nvc_info.c:520] listing device /dev/nvidia-modeset
W0630 12:21:27.181670 124864 nvc_info.c:347] missing ipc path /var/run/nvidia-persistenced/socket
W0630 12:21:27.181700 124864 nvc_info.c:347] missing ipc path /var/run/nvidia-fabricmanager/socket
W0630 12:21:27.181721 124864 nvc_info.c:347] missing ipc path /tmp/nvidia-mps
I0630 12:21:27.181729 124864 nvc_info.c:814] requesting device information with ''
I0630 12:21:27.187620 124864 nvc_info.c:705] listing device /dev/nvidia0 (GPU-b9f5bdeb-9f56-89e8-48f1-5abdbbcffeb5 at 00000000:00:07.0)
NVRM version:   460.91.03
CUDA version:   11.2

Device Index:   0
Device Minor:   0
Model:          Tesla T4
Brand:          Unknown
GPU UUID:       GPU-b9f5bdeb-9f56-89e8-48f1-5abdbbcffeb5
Bus Location:   00000000:00:07.0
Architecture:   7.5
I0630 12:21:27.187673 124864 nvc.c:423] shutting down library context
I0630 12:21:27.188169 124866 driver.c:163] terminating driver service
I0630 12:21:27.188563 124864 driver.c:203] driver service terminated successfully

docker info
Client:
 Debug Mode: false

Server:
 Containers: 21
  Running: 18
  Paused: 0
  Stopped: 3
 Images: 11
 Server Version: 19.03.15
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: nvidia runc
 Default Runtime: nvidia
 Init Binary: docker-init
 Security Options:
  seccomp
   Profile: default
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 14.39GiB
 Name: iZ2zeixjfsr9m8l4nbfzo9Z
 ID: ETCL:FYKN:TKAU:I44W:M6FP:EIXX:RXIE:CEWG:GBST:HNAV:CIG6:RLNA
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Registry Mirrors:
  https://pqbap4ya.mirror.aliyuncs.com/
 Live Restore Enabled: true

Jun 30 '22 12:06 cheyang

I also encountered this problem, which has been occurring for some time.

Jul 01 '22 02:07 liuyanyi

@klueska Could you help take a look? Thanks.

Jul 01 '22 22:07 cheyang

I find these logs during systemd reload:

Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/10:200: No such file or directory
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/195:0: No such file or directory
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/195:254: No such file or directory
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/195:255: No such file or directory
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/237:0: No such file or directory
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/237:1: No such file or directory

From the major and minor number of these devices, I find they are /dev/nvidia* devices, if i manually create these soft links as the following steps, the problem disappeared:

cd /dev/char
ln -s ../nvidia0 195:0
ln -s ../nvidiactl 195:255
ln -s ../nvidia-uvm 237:0

Furthermore, i find runc converts paths from /dev/nvidia* to /dev/char/*, the logic can be found here https://github.com/opencontainers/runc/blob/release-1.0/libcontainer/cgroups/systemd/common.go#L177.

So i wonder if nvidia toolkits should provide something like udev rules that can trigger kernel or systemd to create /dev/char/* -> /dev/nvidia* ?

@elezar

Jul 08 '22 09:07 yummypeng

Otherwise, if there exists a configuration file that we can explicitly set DeviceAllow as /dev/nvidia* which can be recognized by systemd?

Jul 08 '22 09:07 yummypeng

@klueska Could you help take a look? Thanks.

hey, I have been experienced this issue for a long time, I solved this by adding --privilege to the dockers which need graphic card, hope this helps.

Sep 14 '22 09:09 TucoBruto

@klueska Could you help take a look? Thanks.

hey, I have been experienced this issue for a long time, I solved this by adding --privilege to the dockers which need graphic card, hope this helps.

Thanks for response. But I'm not able to set privilege because I'm using it in Kubernetes, and it will let user see all the gpus.

Sep 22 '22 12:09 cheyang

@klueska Could you help take a look? Thanks.

hey, I have been experienced this issue for a long time, I solved this by adding --privilege to the dockers which need graphic card, hope this helps.

Thanks for response. But I'm not able to set privilege because I'm using it in Kubernetes, and it will let user see all the gpus.

I fixed this issue in our env (centos 8, systemd 239) perfectly with cgroup v2, for both docker and containerd nodes. i can share the steps how we fixed it by upgrading from cgroup1 to cgroup2, if that's an option for you.

Nov 21 '22 20:11 gengwg

I'm using cgroups v2 myself so I would be interested in hearing what you did @gengwg

Nov 22 '22 13:11 mbentley

I'm using cgroups v2 myself so I would be interested in hearing what you did @gengwg

Sure here I wrote the detailed steps how I fixed it using cgroup v2. Let me know if it works in your env.

https://gist.github.com/gengwg/55b3eb2bc22bcbd484fccbc0978484fc

Nov 22 '22 19:11 gengwg

In that case, whatever the trigger is that you're seeing apparently isn't the same as mine as all that your instructions do is switch from cgroups v1 to v2. I'm already on cgroups v2 here on Debian 11 (bullseye) and I know that just having cgroups v2 enabled doesn't fix anything for me.

# systemctl --version
systemd 247 (247.3-7+deb11u1)

# dpkg -l | grep libnvidia-container
ii  libnvidia-container-tools             1.11.0-1                                 amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64            1.11.0-1                                 amd64        NVIDIA container runtime library

# runc --version
runc version 1.1.4
commit: v1.1.4-0-g5fd4c4d
spec: 1.0.2-dev
go: go1.18.8
libseccomp: 2.5.1

# containerd --version
containerd containerd.io 1.6.10 770bd0108c32f3fb5c73ae1264f7e503fe7b2661

# uname -a
Linux athena 5.10.0-19-amd64 #1 SMP Debian 5.10.149-2 (2022-10-21) x86_64 GNU/Linux

Nov 23 '22 02:11 mbentley

yeah i do see some people still reporting it in v2, example this.

time wise, this issue starts to appear after we upgraded from centos 7 to centos 8. all components (kernel, systemd, containerd, nvidia runtime, etc.) on the pipeline all got upgraded. so i'm not totally sure which component (or possibly multiple components) caused this issue. in our case v1 to v2 seems fixed this issue so far for a week or so. i will monitor it in case it's back again.

Nov 23 '22 18:11 gengwg

nvidia-docker nvidia-docker copied to clipboard

Failed to initialize NVML: Unknown Error after calling`systemctl daemon-reload`

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

nvidia-docker
nvidia-docker copied to clipboard