nvidia-docker
nvidia-docker copied to clipboard
Failed to initialize NVML: Unknown Error after calling`systemctl daemon-reload`
1. Issue or feature description
Failed to initialize NVML: Unknown Error
does not occurred in initial NVIDIA docker created, but it's happened after calling systemctl daemon-reload
.
It works fine in
Kernel: 4.19.91 and systemd 219.
But it doesn't work in
Kernel: 5.10.23 and systemd 239.
I tried to monitor it with bpftrace:
During container startup, I can see event:
systemd, 1, c 195:* m, cri-containerd-d254b91e9b76d5e6a1b787f4fc6004f0f6318ac910360ee0
runc, 2, c 195:* m, cri-containerd-d254b91e9b76d5e6a1b787f4fc6004f0f6318ac910360ee0
runc, 1, c 195:0 rw, cri-containerd-ef460b0cb5dc0f9103ef2ed266863333eded42abedaf7960
runc, 1, c 195:254 rw, cri-containerd-ef460b0cb5dc0f9103ef2ed266863333eded42abedaf7960
runc, 1, c 195:255 rw, cri-containerd-ef460b0cb5dc0f9103ef2ed266863333eded42abedaf7960
And I can see the devicel.list
in container as below:
cat /sys/fs/cgroup/devices/devices.list
...
c 195:254 rw
c 195:0 rw
But after running systemctl daemon-reload
, I find the event:
systemd, 1, c 195:* m, cri-containerd-d254b91e9b76d5e6a1b787f4fc6004f0f6318ac910360ee0
And the devicel.list
in container as below:
cat /sys/fs/cgroup/devices/devices.list
...
c 195:* m
GPU device is not able be rw
.
Currently I'm not able to use cgroup V2
. Any suggestions about it? Thanks very much.
2. Steps to reproduce the issue
- Run container
docker run --env NVIDIA_VISIBLE_DEVICES=all --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 --name test -itd nvidia/cuda:11.2.1-devel-ubuntu20.04 bash
- Check nvidia-smi
docker exec -it test nvidia-smi
Thu Jun 30 12:33:57 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:07.0 Off | 0 |
| N/A 31C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
- check device cgroup
docker exec -it test cat /sys/fs/cgroup/devices/devices.list
c 136:* rwm
c 5:2 rwm
c 5:1 rwm
c 5:0 rwm
c 1:9 rwm
c 1:8 rwm
c 1:7 rwm
c 1:5 rwm
c 1:3 rwm
b *:* m
c *:* m
c 10:200 rwm
c 195:0 rwm
c 195:255 rwm
c 237:0 rwm
c 237:1 rw
- call systemd reload
systemctl daemon-reload
- check nvidia-smi
docker exec -it test nvidia-smi
Failed to initialize NVML: Unknown Error
- check device list
docker exec -it test cat /sys/fs/cgroup/devices/devices.list
b 9:* m
b 253:* m
b 254:* m
b 259:* m
c 1:* m
c 4:* m
c 5:* m
c 7:* m
c 10:* m
c 13:* m
c 29:* m
c 128:* m
c 136:* rwm
c 162:* m
c 180:* m
c 188:* m
c 189:* m
c 195:* m
3. Information to attach (optional if deemed irrelevant)
- [ x] Some nvidia-container information:
nvidia-container-cli -k -d /dev/tty info
- [ ] Kernel version from
uname -a
- [ ] Any relevant kernel output lines from
dmesg
- [ ] Driver information from
nvidia-smi -a
- [ x] Docker version from
docker version
- [ ] NVIDIA packages version from
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
- [ ] NVIDIA container library version from
nvidia-container-cli -V
- [ ] NVIDIA container library logs (see troubleshooting)
- [ ] Docker command, image and tag used
nvidia-container-cli -k -d /dev/tty info
-- WARNING, the following logs are for debugging purposes only --
I0630 12:21:27.164651 124864 nvc.c:372] initializing library context (version=1.7.0, build=f37bb387ad05f6e501069d99e4135a97289faf1f)
I0630 12:21:27.164727 124864 nvc.c:346] using root /
I0630 12:21:27.164736 124864 nvc.c:347] using ldcache /etc/ld.so.cache
I0630 12:21:27.164742 124864 nvc.c:348] using unprivileged user 65534:65534
I0630 12:21:27.164767 124864 nvc.c:389] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0630 12:21:27.164915 124864 nvc.c:391] dxcore initialization failed, continuing assuming a non-WSL environment
I0630 12:21:27.166200 124865 nvc.c:274] loading kernel module nvidia
I0630 12:21:27.166344 124865 nvc.c:278] running mknod for /dev/nvidiactl
I0630 12:21:27.166383 124865 nvc.c:282] running mknod for /dev/nvidia0
I0630 12:21:27.166400 124865 nvc.c:286] running mknod for all nvcaps in /dev/nvidia-caps
I0630 12:21:27.171675 124865 nvc.c:214] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
I0630 12:21:27.171818 124865 nvc.c:214] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor
I0630 12:21:27.173614 124865 nvc.c:292] loading kernel module nvidia_uvm
I0630 12:21:27.173661 124865 nvc.c:296] running mknod for /dev/nvidia-uvm
I0630 12:21:27.173750 124865 nvc.c:301] loading kernel module nvidia_modeset
I0630 12:21:27.173783 124865 nvc.c:305] running mknod for /dev/nvidia-modeset
I0630 12:21:27.174053 124866 driver.c:101] starting driver service
I0630 12:21:27.177048 124864 nvc_info.c:758] requesting driver information with ''
I0630 12:21:27.178411 124864 nvc_info.c:171] selecting /usr/lib64/vdpau/libvdpau_nvidia.so.460.91.03
I0630 12:21:27.178525 124864 nvc_info.c:171] selecting /usr/lib64/libnvoptix.so.460.91.03
I0630 12:21:27.178580 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-tls.so.460.91.03
I0630 12:21:27.178625 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-rtcore.so.460.91.03
I0630 12:21:27.178676 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-ptxjitcompiler.so.460.91.03
I0630 12:21:27.178740 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-opticalflow.so.460.91.03
I0630 12:21:27.178802 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-opencl.so.460.91.03
I0630 12:21:27.178851 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-ngx.so.460.91.03
I0630 12:21:27.178914 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-ml.so.460.91.03
I0630 12:21:27.178979 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-ifr.so.460.91.03
I0630 12:21:27.179043 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-glvkspirv.so.460.91.03
I0630 12:21:27.179088 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-glsi.so.460.91.03
I0630 12:21:27.179136 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-glcore.so.460.91.03
I0630 12:21:27.179177 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-fbc.so.460.91.03
I0630 12:21:27.179236 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-encode.so.460.91.03
I0630 12:21:27.179311 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-eglcore.so.460.91.03
I0630 12:21:27.179352 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-compiler.so.460.91.03
I0630 12:21:27.179394 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-cfg.so.460.91.03
I0630 12:21:27.179460 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-cbl.so.460.91.03
I0630 12:21:27.179504 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-allocator.so.460.91.03
I0630 12:21:27.179570 124864 nvc_info.c:171] selecting /usr/lib64/libnvcuvid.so.460.91.03
I0630 12:21:27.179715 124864 nvc_info.c:171] selecting /usr/lib64/libcuda.so.460.91.03
I0630 12:21:27.179797 124864 nvc_info.c:171] selecting /usr/lib64/libGLX_nvidia.so.460.91.03
I0630 12:21:27.179846 124864 nvc_info.c:171] selecting /usr/lib64/libGLESv2_nvidia.so.460.91.03
I0630 12:21:27.179900 124864 nvc_info.c:171] selecting /usr/lib64/libGLESv1_CM_nvidia.so.460.91.03
I0630 12:21:27.179950 124864 nvc_info.c:171] selecting /usr/lib64/libEGL_nvidia.so.460.91.03
I0630 12:21:27.180005 124864 nvc_info.c:171] selecting /usr/lib/vdpau/libvdpau_nvidia.so.460.91.03
I0630 12:21:27.180062 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-tls.so.460.91.03
I0630 12:21:27.180106 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-ptxjitcompiler.so.460.91.03
I0630 12:21:27.180157 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-opticalflow.so.460.91.03
I0630 12:21:27.180220 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-opencl.so.460.91.03
I0630 12:21:27.180273 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-ml.so.460.91.03
I0630 12:21:27.180335 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-ifr.so.460.91.03
I0630 12:21:27.180390 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-glvkspirv.so.460.91.03
I0630 12:21:27.180441 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-glsi.so.460.91.03
I0630 12:21:27.180480 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-glcore.so.460.91.03
I0630 12:21:27.180520 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-fbc.so.460.91.03
I0630 12:21:27.180574 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-encode.so.460.91.03
I0630 12:21:27.180626 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-eglcore.so.460.91.03
I0630 12:21:27.180664 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-compiler.so.460.91.03
I0630 12:21:27.180703 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-allocator.so.460.91.03
I0630 12:21:27.180757 124864 nvc_info.c:171] selecting /usr/lib/libnvcuvid.so.460.91.03
I0630 12:21:27.180806 124864 nvc_info.c:171] selecting /usr/lib/libcuda.so.460.91.03
I0630 12:21:27.180860 124864 nvc_info.c:171] selecting /usr/lib/libGLX_nvidia.so.460.91.03
I0630 12:21:27.180908 124864 nvc_info.c:171] selecting /usr/lib/libGLESv2_nvidia.so.460.91.03
I0630 12:21:27.180954 124864 nvc_info.c:171] selecting /usr/lib/libGLESv1_CM_nvidia.so.460.91.03
I0630 12:21:27.181003 124864 nvc_info.c:171] selecting /usr/lib/libEGL_nvidia.so.460.91.03
W0630 12:21:27.181031 124864 nvc_info.c:397] missing library libnvidia-nscq.so
W0630 12:21:27.181040 124864 nvc_info.c:397] missing library libnvidia-fatbinaryloader.so
W0630 12:21:27.181048 124864 nvc_info.c:401] missing compat32 library libnvidia-cfg.so
W0630 12:21:27.181056 124864 nvc_info.c:401] missing compat32 library libnvidia-nscq.so
W0630 12:21:27.181065 124864 nvc_info.c:401] missing compat32 library libnvidia-fatbinaryloader.so
W0630 12:21:27.181074 124864 nvc_info.c:401] missing compat32 library libnvidia-ngx.so
W0630 12:21:27.181081 124864 nvc_info.c:401] missing compat32 library libnvidia-rtcore.so
W0630 12:21:27.181089 124864 nvc_info.c:401] missing compat32 library libnvoptix.so
W0630 12:21:27.181095 124864 nvc_info.c:401] missing compat32 library libnvidia-cbl.so
I0630 12:21:27.181378 124864 nvc_info.c:297] selecting /usr/bin/nvidia-smi
I0630 12:21:27.181406 124864 nvc_info.c:297] selecting /usr/bin/nvidia-debugdump
I0630 12:21:27.181434 124864 nvc_info.c:297] selecting /usr/bin/nvidia-persistenced
I0630 12:21:27.181470 124864 nvc_info.c:297] selecting /usr/bin/nvidia-cuda-mps-control
I0630 12:21:27.181494 124864 nvc_info.c:297] selecting /usr/bin/nvidia-cuda-mps-server
W0630 12:21:27.181542 124864 nvc_info.c:423] missing binary nv-fabricmanager
W0630 12:21:27.181579 124864 nvc_info.c:347] missing firmware path /lib/firmware/nvidia/460.91.03
I0630 12:21:27.181610 124864 nvc_info.c:520] listing device /dev/nvidiactl
I0630 12:21:27.181622 124864 nvc_info.c:520] listing device /dev/nvidia-uvm
I0630 12:21:27.181632 124864 nvc_info.c:520] listing device /dev/nvidia-uvm-tools
I0630 12:21:27.181640 124864 nvc_info.c:520] listing device /dev/nvidia-modeset
W0630 12:21:27.181670 124864 nvc_info.c:347] missing ipc path /var/run/nvidia-persistenced/socket
W0630 12:21:27.181700 124864 nvc_info.c:347] missing ipc path /var/run/nvidia-fabricmanager/socket
W0630 12:21:27.181721 124864 nvc_info.c:347] missing ipc path /tmp/nvidia-mps
I0630 12:21:27.181729 124864 nvc_info.c:814] requesting device information with ''
I0630 12:21:27.187620 124864 nvc_info.c:705] listing device /dev/nvidia0 (GPU-b9f5bdeb-9f56-89e8-48f1-5abdbbcffeb5 at 00000000:00:07.0)
NVRM version: 460.91.03
CUDA version: 11.2
Device Index: 0
Device Minor: 0
Model: Tesla T4
Brand: Unknown
GPU UUID: GPU-b9f5bdeb-9f56-89e8-48f1-5abdbbcffeb5
Bus Location: 00000000:00:07.0
Architecture: 7.5
I0630 12:21:27.187673 124864 nvc.c:423] shutting down library context
I0630 12:21:27.188169 124866 driver.c:163] terminating driver service
I0630 12:21:27.188563 124864 driver.c:203] driver service terminated successfully
docker info
Client:
Debug Mode: false
Server:
Containers: 21
Running: 18
Paused: 0
Stopped: 3
Images: 11
Server Version: 19.03.15
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: false
Logging Driver: json-file
Cgroup Driver: systemd
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: nvidia runc
Default Runtime: nvidia
Init Binary: docker-init
Security Options:
seccomp
Profile: default
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 14.39GiB
Name: iZ2zeixjfsr9m8l4nbfzo9Z
ID: ETCL:FYKN:TKAU:I44W:M6FP:EIXX:RXIE:CEWG:GBST:HNAV:CIG6:RLNA
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Registry Mirrors:
https://pqbap4ya.mirror.aliyuncs.com/
Live Restore Enabled: true
I also encountered this problem, which has been occurring for some time.
@klueska Could you help take a look? Thanks.
I find these logs during systemd reload:
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/10:200: No such file or directory
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/195:0: No such file or directory
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/195:254: No such file or directory
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/195:255: No such file or directory
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/237:0: No such file or directory
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/237:1: No such file or directory
From the major and minor number of these devices, I find they are /dev/nvidia* devices, if i manually create these soft links as the following steps, the problem disappeared:
cd /dev/char
ln -s ../nvidia0 195:0
ln -s ../nvidiactl 195:255
ln -s ../nvidia-uvm 237:0
Furthermore, i find runc converts paths from /dev/nvidia*
to /dev/char/*
, the logic can be found here https://github.com/opencontainers/runc/blob/release-1.0/libcontainer/cgroups/systemd/common.go#L177.
So i wonder if nvidia toolkits should provide something like udev rules that can trigger kernel or systemd to create /dev/char/* -> /dev/nvidia* ?
@elezar
Otherwise, if there exists a configuration file that we can explicitly set DeviceAllow
as /dev/nvidia*
which can be recognized by systemd?
@klueska Could you help take a look? Thanks.
hey, I have been experienced this issue for a long time, I solved this by adding --privilege
to the dockers which need graphic card, hope this helps.
@klueska Could you help take a look? Thanks.
hey, I have been experienced this issue for a long time, I solved this by adding
--privilege
to the dockers which need graphic card, hope this helps.
Thanks for response. But I'm not able to set privilege because I'm using it in Kubernetes, and it will let user see all the gpus.
@klueska Could you help take a look? Thanks.
hey, I have been experienced this issue for a long time, I solved this by adding
--privilege
to the dockers which need graphic card, hope this helps.Thanks for response. But I'm not able to set privilege because I'm using it in Kubernetes, and it will let user see all the gpus.
I fixed this issue in our env (centos 8, systemd 239) perfectly with cgroup v2, for both docker and containerd nodes. i can share the steps how we fixed it by upgrading from cgroup1 to cgroup2, if that's an option for you.
I'm using cgroups v2 myself so I would be interested in hearing what you did @gengwg
I'm using cgroups v2 myself so I would be interested in hearing what you did @gengwg
Sure here I wrote the detailed steps how I fixed it using cgroup v2. Let me know if it works in your env.
https://gist.github.com/gengwg/55b3eb2bc22bcbd484fccbc0978484fc
In that case, whatever the trigger is that you're seeing apparently isn't the same as mine as all that your instructions do is switch from cgroups v1 to v2. I'm already on cgroups v2 here on Debian 11 (bullseye) and I know that just having cgroups v2 enabled doesn't fix anything for me.
# systemctl --version
systemd 247 (247.3-7+deb11u1)
# dpkg -l | grep libnvidia-container
ii libnvidia-container-tools 1.11.0-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.11.0-1 amd64 NVIDIA container runtime library
# runc --version
runc version 1.1.4
commit: v1.1.4-0-g5fd4c4d
spec: 1.0.2-dev
go: go1.18.8
libseccomp: 2.5.1
# containerd --version
containerd containerd.io 1.6.10 770bd0108c32f3fb5c73ae1264f7e503fe7b2661
# uname -a
Linux athena 5.10.0-19-amd64 #1 SMP Debian 5.10.149-2 (2022-10-21) x86_64 GNU/Linux
yeah i do see some people still reporting it in v2, example this.
time wise, this issue starts to appear after we upgraded from centos 7 to centos 8. all components (kernel, systemd, containerd, nvidia runtime, etc.) on the pipeline all got upgraded. so i'm not totally sure which component (or possibly multiple components) caused this issue. in our case v1 to v2 seems fixed this issue so far for a week or so. i will monitor it in case it's back again.