nvidia-docker
nvidia-docker copied to clipboard
"Failed to initialize NVML: Unknown Error" after random amount of time
1. Issue or feature description
After a random amount of time (it could be hours or days) the GPUs become unavailable inside all the running containers and nvidia-smi
returns "Failed to initialize NVML: Unknown Error".
A restart of all the containers fixes the issue and the GPUs return available.
Outside the containers the GPUs are still working correctly.
I tried searching in the open/closed issues but I could not find any solution.
2. Steps to reproduce the issue
All the containers are run with docker run --gpus all -it tensorflow/tensorflow:latest-gpu /bin/bash
3. Information to attach
- [X] Some nvidia-container information:
nvidia-container-cli -k -d /dev/tty info
-- WARNING, the following logs are for debugging purposes only --
I0831 10:36:45.129762 2174149 nvc.c:376] initializing library context (version=1.10.0, build=395fd41701117121f1fd04ada01e1d7e006a37ae)
I0831 10:36:45.129878 2174149 nvc.c:350] using root /
I0831 10:36:45.129892 2174149 nvc.c:351] using ldcache /etc/ld.so.cache
I0831 10:36:45.129906 2174149 nvc.c:352] using unprivileged user 1000:1000
I0831 10:36:45.129960 2174149 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0831 10:36:45.130411 2174149 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W0831 10:36:45.132458 2174150 nvc.c:273] failed to set inheritable capabilities
W0831 10:36:45.132555 2174150 nvc.c:274] skipping kernel modules load due to failure
I0831 10:36:45.133242 2174151 rpc.c:71] starting driver rpc service
I0831 10:36:45.141625 2174152 rpc.c:71] starting nvcgo rpc service
I0831 10:36:45.144941 2174149 nvc_info.c:766] requesting driver information with ''
I0831 10:36:45.146226 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.515.48.07
I0831 10:36:45.146379 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.515.48.07
I0831 10:36:45.146563 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.515.48.07
I0831 10:36:45.146792 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.515.48.07
I0831 10:36:45.146986 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.515.48.07
I0831 10:36:45.147178 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.515.48.07
I0831 10:36:45.147375 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.515.48.07
I0831 10:36:45.147400 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.515.48.07
I0831 10:36:45.147598 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.515.48.07
I0831 10:36:45.147777 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.515.48.07
I0831 10:36:45.147986 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.515.48.07
I0831 10:36:45.148258 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.515.48.07
I0831 10:36:45.148506 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.515.48.07
I0831 10:36:45.148699 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.515.48.07
I0831 10:36:45.148915 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.515.48.07
I0831 10:36:45.148942 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.515.48.07
I0831 10:36:45.149219 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.515.48.07
I0831 10:36:45.149467 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.515.48.07
I0831 10:36:45.149591 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.515.48.07
I0831 10:36:45.149814 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.515.48.07
I0831 10:36:45.149996 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.515.48.07
I0831 10:36:45.150224 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.515.48.07
I0831 10:36:45.150437 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.515.48.07
I0831 10:36:45.150772 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-tls.so.515.48.07
I0831 10:36:45.150978 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.515.48.07
I0831 10:36:45.151147 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opticalflow.so.515.48.07
I0831 10:36:45.151335 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opencl.so.515.48.07
I0831 10:36:45.151592 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ml.so.515.48.07
I0831 10:36:45.151786 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glvkspirv.so.515.48.07
I0831 10:36:45.151970 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glsi.so.515.48.07
I0831 10:36:45.152225 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glcore.so.515.48.07
I0831 10:36:45.152480 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-fbc.so.515.48.07
I0831 10:36:45.152791 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-encode.so.515.48.07
I0831 10:36:45.152999 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-eglcore.so.515.48.07
I0831 10:36:45.153254 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-compiler.so.515.48.07
I0831 10:36:45.153580 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvcuvid.so.515.48.07
I0831 10:36:45.153853 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libcuda.so.515.48.07
I0831 10:36:45.154063 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLX_nvidia.so.515.48.07
I0831 10:36:45.154259 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv2_nvidia.so.515.48.07
I0831 10:36:45.154473 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv1_CM_nvidia.so.515.48.07
I0831 10:36:45.154696 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libEGL_nvidia.so.515.48.07
W0831 10:36:45.154723 2174149 nvc_info.c:399] missing library libnvidia-nscq.so
W0831 10:36:45.154726 2174149 nvc_info.c:399] missing library libcudadebugger.so
W0831 10:36:45.154729 2174149 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so
W0831 10:36:45.154731 2174149 nvc_info.c:399] missing library libnvidia-pkcs11.so
W0831 10:36:45.154733 2174149 nvc_info.c:399] missing library libvdpau_nvidia.so
W0831 10:36:45.154735 2174149 nvc_info.c:399] missing library libnvidia-ifr.so
W0831 10:36:45.154737 2174149 nvc_info.c:399] missing library libnvidia-cbl.so
W0831 10:36:45.154739 2174149 nvc_info.c:403] missing compat32 library libnvidia-cfg.so
W0831 10:36:45.154741 2174149 nvc_info.c:403] missing compat32 library libnvidia-nscq.so
W0831 10:36:45.154743 2174149 nvc_info.c:403] missing compat32 library libcudadebugger.so
W0831 10:36:45.154746 2174149 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so
W0831 10:36:45.154748 2174149 nvc_info.c:403] missing compat32 library libnvidia-allocator.so
W0831 10:36:45.154750 2174149 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so
W0831 10:36:45.154752 2174149 nvc_info.c:403] missing compat32 library libnvidia-ngx.so
W0831 10:36:45.154754 2174149 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so
W0831 10:36:45.154756 2174149 nvc_info.c:403] missing compat32 library libnvidia-ifr.so
W0831 10:36:45.154758 2174149 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so
W0831 10:36:45.154760 2174149 nvc_info.c:403] missing compat32 library libnvoptix.so
W0831 10:36:45.154762 2174149 nvc_info.c:403] missing compat32 library libnvidia-cbl.so
I0831 10:36:45.154919 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-smi
I0831 10:36:45.154945 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump
I0831 10:36:45.154954 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced
I0831 10:36:45.154970 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control
I0831 10:36:45.154980 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server
W0831 10:36:45.155027 2174149 nvc_info.c:425] missing binary nv-fabricmanager
I0831 10:36:45.155044 2174149 nvc_info.c:343] listing firmware path /usr/lib/firmware/nvidia/515.48.07/gsp.bin
I0831 10:36:45.155058 2174149 nvc_info.c:529] listing device /dev/nvidiactl
I0831 10:36:45.155061 2174149 nvc_info.c:529] listing device /dev/nvidia-uvm
I0831 10:36:45.155063 2174149 nvc_info.c:529] listing device /dev/nvidia-uvm-tools
I0831 10:36:45.155065 2174149 nvc_info.c:529] listing device /dev/nvidia-modeset
I0831 10:36:45.155080 2174149 nvc_info.c:343] listing ipc path /run/nvidia-persistenced/socket
W0831 10:36:45.155092 2174149 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket
W0831 10:36:45.155100 2174149 nvc_info.c:349] missing ipc path /tmp/nvidia-mps
I0831 10:36:45.155102 2174149 nvc_info.c:822] requesting device information with ''
I0831 10:36:45.161039 2174149 nvc_info.c:713] listing device /dev/nvidia0 (GPU-13fd0930-06c3-5975-8720-72c72ee7a823 at 00000000:01:00.0)
I0831 10:36:45.166471 2174149 nvc_info.c:713] listing device /dev/nvidia1 (GPU-a76d37d7-5ed0-58d9-6087-b18fee984570 at 00000000:02:00.0)
NVRM version: 515.48.07
CUDA version: 11.7
Device Index: 0
Device Minor: 0
Model: NVIDIA GeForce RTX 2080 Ti
Brand: GeForce
GPU UUID: GPU-13fd0930-06c3-5975-8720-72c72ee7a823
Bus Location: 00000000:01:00.0
Architecture: 7.5
Device Index: 1
Device Minor: 1
Model: NVIDIA GeForce RTX 2080 Ti
Brand: GeForce
GPU UUID: GPU-a76d37d7-5ed0-58d9-6087-b18fee984570
Bus Location: 00000000:02:00.0
Architecture: 7.5
I0831 10:36:45.166493 2174149 nvc.c:434] shutting down library context
I0831 10:36:45.166540 2174152 rpc.c:95] terminating nvcgo rpc service
I0831 10:36:45.166751 2174149 rpc.c:135] nvcgo rpc service terminated successfully
I0831 10:36:45.167790 2174151 rpc.c:95] terminating driver rpc service
I0831 10:36:45.167907 2174149 rpc.c:135] driver rpc service terminated successfully
- [X] Kernel version from
uname -a
Linux wds-co-ml 5.15.0-43-generic #46-Ubuntu SMP Tue Jul 12 10:30:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
- [X] Driver information from
nvidia-smi -a
==============NVSMI LOG==============
Timestamp : Wed Aug 31 12:42:55 2022
Driver Version : 515.48.07
CUDA Version : 11.7
Attached GPUs : 2
GPU 00000000:01:00.0
Product Name : NVIDIA GeForce RTX 2080 Ti
Product Brand : GeForce
Product Architecture : Turing
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-13fd0930-06c3-5975-8720-72c72ee7a823
Minor Number : 0
VBIOS Version : 90.02.0B.00.C7
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : N/A
Module ID : 0
Inforom Version
Image Version : G001.0000.02.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x1E0710DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x150319DA
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 8x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P8
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 11264 MiB
Reserved : 244 MiB
Used : 1 MiB
Free : 11018 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 3 MiB
Free : 253 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 30 C
GPU Shutdown Temp : 94 C
GPU Slowdown Temp : 91 C
GPU Max Operating Temp : 89 C
GPU Target Temperature : 84 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 20.87 W
Power Limit : 260.00 W
Default Power Limit : 260.00 W
Enforced Power Limit : 260.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 300 MHz
SM : 300 MHz
Memory : 405 MHz
Video : 540 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 2160 MHz
SM : 2160 MHz
Memory : 7000 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Processes : None
GPU 00000000:02:00.0
Product Name : NVIDIA GeForce RTX 2080 Ti
Product Brand : GeForce
Product Architecture : Turing
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-a76d37d7-5ed0-58d9-6087-b18fee984570
Minor Number : 1
VBIOS Version : 90.02.17.00.58
MultiGPU Board : No
Board ID : 0x200
GPU Part Number : N/A
Module ID : 0
Inforom Version
Image Version : G001.0000.02.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x02
Device : 0x00
Domain : 0x0000
Device Id : 0x1E0710DE
Bus Id : 00000000:02:00.0
Sub System Id : 0x150319DA
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 8x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 35 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 11264 MiB
Reserved : 244 MiB
Used : 1 MiB
Free : 11018 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 27 MiB
Free : 229 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 28 C
GPU Shutdown Temp : 94 C
GPU Slowdown Temp : 91 C
GPU Max Operating Temp : 89 C
GPU Target Temperature : 84 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 6.66 W
Power Limit : 260.00 W
Default Power Limit : 260.00 W
Enforced Power Limit : 260.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 300 MHz
SM : 300 MHz
Memory : 405 MHz
Video : 540 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 2160 MHz
SM : 2160 MHz
Memory : 7000 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Processes : None
- [X] Docker version from
docker version
Client: Docker Engine - Community
Version: 20.10.17
API version: 1.41
Go version: go1.17.11
Git commit: 100c701
Built: Mon Jun 6 23:02:46 2022
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.17
API version: 1.41 (minimum version 1.12)
Go version: go1.17.11
Git commit: a89b842
Built: Mon Jun 6 23:00:51 2022
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.6
GitCommit: 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
runc:
Version: 1.1.2
GitCommit: v1.1.2-0-ga916309
docker-init:
Version: 0.19.0
GitCommit: de40ad0
- [X] NVIDIA packages version from
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
ii libnvidia-cfg1-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA binary OpenGL/GLX configuration library
ii libnvidia-common-515 515.48.07-0ubuntu0.22.04.2 all Shared files used by the NVIDIA libraries
ii libnvidia-compute-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA libcompute package
ii libnvidia-compute-515:i386 515.48.07-0ubuntu0.22.04.2 i386 NVIDIA libcompute package
ii libnvidia-container-tools 1.10.0-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.10.0-1 amd64 NVIDIA container runtime library
ii libnvidia-decode-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA Video Decoding runtime libraries
ii libnvidia-decode-515:i386 515.48.07-0ubuntu0.22.04.2 i386 NVIDIA Video Decoding runtime libraries
ii libnvidia-egl-wayland1:amd64 1:1.1.9-1.1 amd64 Wayland EGL External Platform library -- shared library
ii libnvidia-encode-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 NVENC Video Encoding runtime library
ii libnvidia-encode-515:i386 515.48.07-0ubuntu0.22.04.2 i386 NVENC Video Encoding runtime library
ii libnvidia-extra-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 Extra libraries for the NVIDIA driver
ii libnvidia-fbc1-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-fbc1-515:i386 515.48.07-0ubuntu0.22.04.2 i386 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-gl-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii libnvidia-gl-515:i386 515.48.07-0ubuntu0.22.04.2 i386 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii linux-modules-nvidia-515-5.15.0-43-generic 5.15.0-43.46 amd64 Linux kernel nvidia modules for version 5.15.0-43
ii linux-modules-nvidia-515-generic-hwe-22.04 5.15.0-43.46 amd64 Extra drivers for nvidia-515 for the generic-hwe-22.04 flavour
ii linux-objects-nvidia-515-5.15.0-43-generic 5.15.0-43.46 amd64 Linux kernel nvidia modules for version 5.15.0-43 (objects)
ii linux-signatures-nvidia-5.15.0-43-generic 5.15.0-43.46 amd64 Linux kernel signatures for nvidia modules for version 5.15.0-43-generic
ii nvidia-compute-utils-515 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA compute utilities
ii nvidia-container-toolkit 1.10.0-1 amd64 NVIDIA container runtime hook
ii nvidia-docker2 2.11.0-1 all nvidia-docker CLI wrapper
ii nvidia-driver-515 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA driver metapackage
ii nvidia-kernel-common-515 515.48.07-0ubuntu0.22.04.2 amd64 Shared files used with the kernel module
ii nvidia-kernel-source-515 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA kernel source package
ii nvidia-prime 0.8.17.1 all Tools to enable NVIDIA's Prime
ii nvidia-settings 510.47.03-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
ii nvidia-utils-515 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA driver support binaries
ii xserver-xorg-video-nvidia-515 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA binary Xorg driver
- [X] NVIDIA container library version from
nvidia-container-cli -V
cli-version: 1.10.0
lib-version: 1.10.0
build date: 2022-06-13T10:39+00:00
build revision: 395fd41701117121f1fd04ada01e1d7e006a37ae
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
- [X] Docker command, image and tag used
docker run --gpus all -it tensorflow/tensorflow:latest-gpu /bin/bash
The nvidia-smi
output show persistence mode as being disabled. Does the behaviour still exist when this is enabled?
Hey, I have the same problem.
2. Steps to reproduce the issue
docker run --gpus all --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash
root@098b49afe624:/# nvidia-smi
Fri Sep 2 21:54:31 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.68.02 Driver Version: 510.68.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
This works until you do systemctl daemon-reload
either manually or automatically through the OS (I assume, since it eventually will fail).
(on host):
systemctl daemon-reload
(inside same running container):
root@098b49afe624:/# nvidia-smi
Failed to initialize NVML: Unknown Error
Running the container again will work fine until you do another systemctl daemon-reload
.
3. Information to attach (optional if deemed irrelevant)
- [x] Some nvidia-container information:
nvidia-container-cli -k -d /dev/tty info
I0902 21:40:53.603015 2836338 nvc.c:376] initializing library context (version=1.10.0, build=395fd41701117121f1fd04ada01e1d7e006a37ae)
I0902 21:40:53.603083 2836338 nvc.c:350] using root /
I0902 21:40:53.603093 2836338 nvc.c:351] using ldcache /etc/ld.so.cache
I0902 21:40:53.603100 2836338 nvc.c:352] using unprivileged user 1000:1000
I0902 21:40:53.603133 2836338 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0902 21:40:53.603287 2836338 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W0902 21:40:53.607634 2836339 nvc.c:273] failed to set inheritable capabilities
W0902 21:40:53.607692 2836339 nvc.c:274] skipping kernel modules load due to failure
I0902 21:40:53.608141 2836340 rpc.c:71] starting driver rpc service
I0902 21:40:53.620107 2836341 rpc.c:71] starting nvcgo rpc service
I0902 21:40:53.621514 2836338 nvc_info.c:766] requesting driver information with ''
I0902 21:40:53.623204 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.510.68.02
I0902 21:40:53.623384 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.510.68.02
I0902 21:40:53.623470 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.510.68.02
I0902 21:40:53.623534 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.510.68.02
I0902 21:40:53.623599 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.510.68.02
I0902 21:40:53.623686 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.510.68.02
I0902 21:40:53.623774 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.510.68.02
I0902 21:40:53.623838 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.510.68.02
I0902 21:40:53.623900 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.510.68.02
I0902 21:40:53.623987 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.510.68.02
I0902 21:40:53.624046 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.510.68.02
I0902 21:40:53.624105 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.510.68.02
I0902 21:40:53.624167 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.510.68.02
I0902 21:40:53.624270 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.510.68.02
I0902 21:40:53.624362 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.510.68.02
I0902 21:40:53.624430 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.510.68.02
I0902 21:40:53.624507 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.510.68.02
I0902 21:40:53.624590 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.510.68.02
I0902 21:40:53.624684 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.510.68.02
I0902 21:40:53.624959 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.510.68.02
I0902 21:40:53.625088 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.510.68.02
I0902 21:40:53.625151 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.510.68.02
I0902 21:40:53.625213 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.510.68.02
I0902 21:40:53.625277 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.510.68.02
W0902 21:40:53.625310 2836338 nvc_info.c:399] missing library libnvidia-nscq.so
W0902 21:40:53.625322 2836338 nvc_info.c:399] missing library libcudadebugger.so
W0902 21:40:53.625330 2836338 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so
W0902 21:40:53.625340 2836338 nvc_info.c:399] missing library libnvidia-pkcs11.so
W0902 21:40:53.625349 2836338 nvc_info.c:399] missing library libnvidia-ifr.so
W0902 21:40:53.625359 2836338 nvc_info.c:399] missing library libnvidia-cbl.so
W0902 21:40:53.625368 2836338 nvc_info.c:403] missing compat32 library libnvidia-ml.so
W0902 21:40:53.625376 2836338 nvc_info.c:403] missing compat32 library libnvidia-cfg.so
W0902 21:40:53.625386 2836338 nvc_info.c:403] missing compat32 library libnvidia-nscq.so
W0902 21:40:53.625394 2836338 nvc_info.c:403] missing compat32 library libcuda.so
W0902 21:40:53.625404 2836338 nvc_info.c:403] missing compat32 library libcudadebugger.so
W0902 21:40:53.625413 2836338 nvc_info.c:403] missing compat32 library libnvidia-opencl.so
W0902 21:40:53.625422 2836338 nvc_info.c:403] missing compat32 library libnvidia-ptxjitcompiler.so
W0902 21:40:53.625432 2836338 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so
W0902 21:40:53.625441 2836338 nvc_info.c:403] missing compat32 library libnvidia-allocator.so
W0902 21:40:53.625450 2836338 nvc_info.c:403] missing compat32 library libnvidia-compiler.so
W0902 21:40:53.625459 2836338 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so
W0902 21:40:53.625468 2836338 nvc_info.c:403] missing compat32 library libnvidia-ngx.so
W0902 21:40:53.625477 2836338 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so
W0902 21:40:53.625486 2836338 nvc_info.c:403] missing compat32 library libnvidia-encode.so
W0902 21:40:53.625495 2836338 nvc_info.c:403] missing compat32 library libnvidia-opticalflow.so
W0902 21:40:53.625505 2836338 nvc_info.c:403] missing compat32 library libnvcuvid.so
W0902 21:40:53.625514 2836338 nvc_info.c:403] missing compat32 library libnvidia-eglcore.so
W0902 21:40:53.625523 2836338 nvc_info.c:403] missing compat32 library libnvidia-glcore.so
W0902 21:40:53.625532 2836338 nvc_info.c:403] missing compat32 library libnvidia-tls.so
W0902 21:40:53.625541 2836338 nvc_info.c:403] missing compat32 library libnvidia-glsi.so
W0902 21:40:53.625551 2836338 nvc_info.c:403] missing compat32 library libnvidia-fbc.so
W0902 21:40:53.625561 2836338 nvc_info.c:403] missing compat32 library libnvidia-ifr.so
W0902 21:40:53.625570 2836338 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so
W0902 21:40:53.625579 2836338 nvc_info.c:403] missing compat32 library libnvoptix.so
W0902 21:40:53.625588 2836338 nvc_info.c:403] missing compat32 library libGLX_nvidia.so
W0902 21:40:53.625598 2836338 nvc_info.c:403] missing compat32 library libEGL_nvidia.so
W0902 21:40:53.625607 2836338 nvc_info.c:403] missing compat32 library libGLESv2_nvidia.so
W0902 21:40:53.625616 2836338 nvc_info.c:403] missing compat32 library libGLESv1_CM_nvidia.so
W0902 21:40:53.625625 2836338 nvc_info.c:403] missing compat32 library libnvidia-glvkspirv.so
W0902 21:40:53.625631 2836338 nvc_info.c:403] missing compat32 library libnvidia-cbl.so
I0902 21:40:53.626022 2836338 nvc_info.c:299] selecting /usr/bin/nvidia-smi
I0902 21:40:53.626055 2836338 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump
I0902 21:40:53.626088 2836338 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced
I0902 21:40:53.626139 2836338 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control
I0902 21:40:53.626172 2836338 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server
W0902 21:40:53.626281 2836338 nvc_info.c:425] missing binary nv-fabricmanager
I0902 21:40:53.626333 2836338 nvc_info.c:343] listing firmware path /usr/lib/firmware/nvidia/510.68.02/gsp.bin
I0902 21:40:53.626375 2836338 nvc_info.c:529] listing device /dev/nvidiactl
I0902 21:40:53.626385 2836338 nvc_info.c:529] listing device /dev/nvidia-uvm
I0902 21:40:53.626395 2836338 nvc_info.c:529] listing device /dev/nvidia-uvm-tools
I0902 21:40:53.626404 2836338 nvc_info.c:529] listing device /dev/nvidia-modeset
W0902 21:40:53.626447 2836338 nvc_info.c:349] missing ipc path /var/run/nvidia-persistenced/socket
W0902 21:40:53.626483 2836338 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket
W0902 21:40:53.626510 2836338 nvc_info.c:349] missing ipc path /tmp/nvidia-mps
I0902 21:40:53.626521 2836338 nvc_info.c:822] requesting device information with ''
I0902 21:40:53.633742 2836338 nvc_info.c:713] listing device /dev/nvidia0 (GPU-9c416c82-d801-d28f-0867-dd438d4be914 at 00000000:04:00.0)
I0902 21:40:53.640730 2836338 nvc_info.c:713] listing device /dev/nvidia1 (GPU-32a56b8c-943e-03e7-d539-3e97e5ef5f7a at 00000000:05:00.0)
I0902 21:40:53.647954 2836338 nvc_info.c:713] listing device /dev/nvidia2 (GPU-a0e33485-87cd-ceb1-2702-2c58a64a9dbe at 00000000:08:00.0)
I0902 21:40:53.655371 2836338 nvc_info.c:713] listing device /dev/nvidia3 (GPU-1ab2485c-121c-77db-6719-0b616d1673f4 at 00000000:09:00.0)
I0902 21:40:53.663009 2836338 nvc_info.c:713] listing device /dev/nvidia4 (GPU-e7e3d7b6-ddce-355a-7988-80c4ba18319c at 00000000:0b:00.0)
I0902 21:40:53.670891 2836338 nvc_info.c:713] listing device /dev/nvidia5 (GPU-c16444fb-bedb-106d-c188-1f330773cf39 at 00000000:84:00.0)
I0902 21:40:53.679015 2836338 nvc_info.c:713] listing device /dev/nvidia6 (GPU-2545ac9e-3ff1-8b38-8ad6-b8c82fea6cd0 at 00000000:85:00.0)
I0902 21:40:53.687078 2836338 nvc_info.c:713] listing device /dev/nvidia7 (GPU-fcc35ab7-1afd-e678-b5f0-d1e1f8842d28 at 00000000:89:00.0)
NVRM version: 510.68.02
CUDA version: 11.6
Device Index: 0
Device Minor: 0
Model: NVIDIA TITAN X (Pascal)
Brand: TITAN
GPU UUID: GPU-9c416c82-d801-d28f-0867-dd438d4be914
Bus Location: 00000000:04:00.0
Architecture: 6.1
Device Index: 1
Device Minor: 1
Model: NVIDIA TITAN X (Pascal)
Brand: TITAN
GPU UUID: GPU-32a56b8c-943e-03e7-d539-3e97e5ef5f7a
Bus Location: 00000000:05:00.0
Architecture: 6.1
Device Index: 2
Device Minor: 2
Model: NVIDIA TITAN X (Pascal)
Brand: TITAN
GPU UUID: GPU-a0e33485-87cd-ceb1-2702-2c58a64a9dbe
Bus Location: 00000000:08:00.0
Architecture: 6.1
Device Index: 3
Device Minor: 3
Model: NVIDIA TITAN X (Pascal)
Brand: TITAN
GPU UUID: GPU-1ab2485c-121c-77db-6719-0b616d1673f4
Bus Location: 00000000:09:00.0
Architecture: 6.1
Device Index: 4
Device Minor: 4
Model: NVIDIA TITAN X (Pascal)
Brand: TITAN
GPU UUID: GPU-e7e3d7b6-ddce-355a-7988-80c4ba18319c
Bus Location: 00000000:0b:00.0
Architecture: 6.1
Device Index: 5
Device Minor: 5
Model: NVIDIA TITAN X (Pascal)
Brand: TITAN
GPU UUID: GPU-c16444fb-bedb-106d-c188-1f330773cf39
Bus Location: 00000000:84:00.0
Architecture: 6.1
Device Index: 6
Device Minor: 6
Model: NVIDIA TITAN X (Pascal)
Brand: TITAN
GPU UUID: GPU-2545ac9e-3ff1-8b38-8ad6-b8c82fea6cd0
Bus Location: 00000000:85:00.0
Architecture: 6.1
Device Index: 7
Device Minor: 7
Model: NVIDIA TITAN X (Pascal)
Brand: TITAN
GPU UUID: GPU-fcc35ab7-1afd-e678-b5f0-d1e1f8842d28
Bus Location: 00000000:89:00.0
Architecture: 6.1
I0902 21:40:53.687293 2836338 nvc.c:434] shutting down library context
I0902 21:40:53.687347 2836341 rpc.c:95] terminating nvcgo rpc service
I0902 21:40:53.687881 2836338 rpc.c:135] nvcgo rpc service terminated successfully
I0902 21:40:53.692819 2836340 rpc.c:95] terminating driver rpc service
I0902 21:40:53.693046 2836338 rpc.c:135] driver rpc service terminated successfully
-
[x] Kernel version from
uname -a
Linux node5-4 5.15.0-46-generic NVIDIA/nvidia-docker#49-Ubuntu SMP Thu Aug 4 18:03:25 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
-
[x] Any relevant kernel output lines from
dmesg
Nothing relevant from dmesg, but only thing relevant from journalctl isSep 02 21:17:56 node5-4 systemd[1]: Reloading.
once I do asystemctl daemon-reload
-
[x] Driver information from
nvidia-smi -a
Fri Sep 2 21:22:32 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.68.02 Driver Version: 510.68.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA TITAN X ... On | 00000000:04:00.0 Off | N/A |
| 23% 23C P8 8W / 250W | 0MiB / 12288MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA TITAN X ... On | 00000000:05:00.0 Off | N/A |
| 23% 26C P8 9W / 250W | 0MiB / 12288MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA TITAN X ... On | 00000000:08:00.0 Off | N/A |
| 23% 22C P8 7W / 250W | 0MiB / 12288MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA TITAN X ... On | 00000000:09:00.0 Off | N/A |
| 23% 24C P8 8W / 250W | 0MiB / 12288MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA TITAN X ... On | 00000000:0B:00.0 Off | N/A |
| 23% 26C P8 9W / 250W | 0MiB / 12288MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA TITAN X ... On | 00000000:84:00.0 Off | N/A |
| 23% 25C P8 8W / 250W | 0MiB / 12288MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA TITAN X ... On | 00000000:85:00.0 Off | N/A |
| 23% 22C P8 8W / 250W | 0MiB / 12288MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA TITAN X ... On | 00000000:89:00.0 Off | N/A |
| 23% 23C P8 7W / 250W | 0MiB / 12288MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
- [x] Docker version from
docker version
Client: Docker Engine - Community
Version: 20.10.17
API version: 1.41
Go version: go1.17.11
Git commit: 100c701
Built: Mon Jun 6 23:02:46 2022
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.17
API version: 1.41 (minimum version 1.12)
Go version: go1.17.11
Git commit: a89b842
Built: Mon Jun 6 23:00:51 2022
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.4
GitCommit: 212e8b6fa2f44b9c21b2798135fc6fb7c53efc16
runc:
Version: 1.1.1
GitCommit: v1.1.1-0-g52de29d
docker-init:
Version: 0.19.0
GitCommit: de40ad0
- [x] NVIDIA packages version from
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-=============================-============-============-=====================================================
ii libnvidia-container-tools 1.10.0-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.10.0-1 amd64 NVIDIA container runtime library
ii nvidia-container-runtime 3.10.0-1 all NVIDIA container runtime
un nvidia-container-runtime-hook <none> <none> (no description available)
ii nvidia-container-toolkit 1.10.0-1 amd64 NVIDIA container runtime hook
- [x] NVIDIA container library version from
nvidia-container-cli -V
cli-version: 1.10.0
lib-version: 1.10.0
build date: 2022-06-13T10:39+00:00
build revision: 395fd41701117121f1fd04ada01e1d7e006a37ae
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
- [x] NVIDIA container library logs (see troubleshooting)
I0902 22:11:39.880399 2840718 nvc.c:376] initializing library context (version=1.10.0, build=395fd41701117121f1fd04ada01e1d7e006a37ae) I0902 22:11:39.880483 2840718 nvc.c:350] using root / I0902 22:11:39.880501 2840718 nvc.c:351] using ldcache /etc/ld.so.cache
I0902 22:11:39.880514 2840718 nvc.c:352] using unprivileged user 65534:65534
I0902 22:11:39.880559 2840718 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I0902 22:11:39.880751 2840718 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment I0902 22:11:39.884769 2840724 nvc.c:278] loading kernel module nvidia
I0902 22:11:39.884931 2840724 nvc.c:282] running mknod for /dev/nvidiactl
I0902 22:11:39.884991 2840724 nvc.c:286] running mknod for /dev/nvidia0
I0902 22:11:39.885033 2840724 nvc.c:286] running mknod for /dev/nvidia1
I0902 22:11:39.885071 2840724 nvc.c:286] running mknod for /dev/nvidia2
I0902 22:11:39.885109 2840724 nvc.c:286] running mknod for /dev/nvidia3
I0902 22:11:39.885147 2840724 nvc.c:286] running mknod for /dev/nvidia4
I0902 22:11:39.885185 2840724 nvc.c:286] running mknod for /dev/nvidia5
I0902 22:11:39.885222 2840724 nvc.c:286] running mknod for /dev/nvidia6
I0902 22:11:39.885260 2840724 nvc.c:286] running mknod for /dev/nvidia7
I0902 22:11:39.885298 2840724 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps I0902 22:11:39.892775 2840724 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config I0902 22:11:39.892935 2840724 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor I0902 22:11:39.899624 2840724 nvc.c:296] loading kernel module nvidia_uvm I0902 22:11:39.899673 2840724 nvc.c:300] running mknod for /dev/nvidia-uvm I0902 22:11:39.899778 2840724 nvc.c:305] loading kernel module nvidia_modeset
I0902 22:11:39.899820 2840724 nvc.c:309] running mknod for /dev/nvidia-modeset
I0902 22:11:39.900186 2840725 rpc.c:71] starting driver rpc service I0902 22:11:39.911718 2840726 rpc.c:71] starting nvcgo rpc service I0902 22:11:39.912892 2840718 nvc_container.c:240] configuring container with 'compute utility supervised' I0902 22:11:39.913283 2840718 nvc_container.c:88] selecting /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/local/cuda-11.4/compat/libcuda.so.470.129.06 I0902 22:11:39.913368 2840718 nvc_container.c:88] selecting /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/local/cuda-11.4/compat/libnvidia-ptxjitcompiler.so.470.129.06 I0902 22:11:39.915116 2840718 nvc_container.c:262] setting pid to 2840712 I0902 22:11:39.915147 2840718 nvc_container.c:263] setting rootfs to /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged I0902 22:11:39.915160 2840718 nvc_container.c:264] setting owner to 0:0 I0902 22:11:39.915171 2840718 nvc_container.c:265] setting bins directory to /usr/bin I0902 22:11:39.915182 2840718 nvc_container.c:266] setting libs directory to /usr/lib/x86_64-linux-gnu I0902 22:11:39.915193 2840718 nvc_container.c:267] setting libs32 directory to /usr/lib/i386-linux-gnu I0902 22:11:39.915204 2840718 nvc_container.c:268] setting cudart directory to /usr/local/cuda I0902 22:11:39.915215 2840718 nvc_container.c:269] setting ldconfig to @/sbin/ldconfig.real (host relative) I0902 22:11:39.915228 2840718 nvc_container.c:270] setting mount namespace to /proc/2840712/ns/mnt I0902 22:11:39.915240 2840718 nvc_container.c:272] detected cgroupv2 I0902 22:11:39.915271 2840718 nvc_container.c:273] setting devices cgroup to /sys/fs/cgroup/system.slice/docker-5fff6f80850791d3858cb511015581375d55ae42df5eb98262ceae31ed47a7d5.scope I0902 22:11:39.915292 2840718 nvc_info.c:766] requesting driver information with '' I0902 22:11:39.916901 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.510.68.02 I0902 22:11:39.917076 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.510.68.02 I0902 22:11:39.917165 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.510.68.02
I0902 22:11:39.917236 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.510.68.02
I0902 22:11:39.917318 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.510.68.02
I0902 22:11:39.917411 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.510.68.02
I0902 22:11:39.917503 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.510.68.02
I0902 22:11:39.917574 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.510.68.02
I0902 22:11:39.917639 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.510.68.02
I0902 22:11:39.917730 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.510.68.02
I0902 22:11:39.917794 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.510.68.02
I0902 22:11:39.917859 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.510.68.02
I0902 22:11:39.917926 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.510.68.02 I0902 22:11:39.918018 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.510.68.02 I0902 22:11:39.918109 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.510.68.02
I0902 22:11:39.918176 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.510.68.02
I0902 22:11:39.918243 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.510.68.02
I0902 22:11:39.918335 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.510.68.02
I0902 22:11:39.918429 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.510.68.02
I0902 22:11:39.918628 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.510.68.02
I0902 22:11:39.918758 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.510.68.02
I0902 22:11:39.918827 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.510.68.02
I0902 22:11:39.918896 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.510.68.02
I0902 22:11:39.918968 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.510.68.02
W0902 22:11:39.919005 2840718 nvc_info.c:399] missing library libnvidia-nscq.so W0902 22:11:39.919022 2840718 nvc_info.c:399] missing library libcudadebugger.so W0902 22:11:39.919035 2840718 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so W0902 22:11:39.919049 2840718 nvc_info.c:399] missing library libnvidia-pkcs11.so
W0902 22:11:39.919061 2840718 nvc_info.c:399] missing library libnvidia-ifr.so
W0902 22:11:39.919074 2840718 nvc_info.c:399] missing library libnvidia-cbl.so W0902 22:11:39.919088 2840718 nvc_info.c:403] missing compat32 library libnvidia-ml.so W0902 22:11:39.919107 2840718 nvc_info.c:403] missing compat32 library libnvidia-cfg.so
W0902 22:11:39.919119 2840718 nvc_info.c:403] missing compat32 library libnvidia-nscq.so
W0902 22:11:39.919131 2840718 nvc_info.c:403] missing compat32 library libcuda.so
W0902 22:11:39.919144 2840718 nvc_info.c:403] missing compat32 library libcudadebugger.so
W0902 22:11:39.919156 2840718 nvc_info.c:403] missing compat32 library libnvidia-opencl.so
W0902 22:11:39.919168 2840718 nvc_info.c:403] missing compat32 library libnvidia-ptxjitcompiler.so
W0902 22:11:39.919192 2840718 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so
W0902 22:11:39.919206 2840718 nvc_info.c:403] missing compat32 library libnvidia-allocator.so
W0902 22:11:39.919218 2840718 nvc_info.c:403] missing compat32 library libnvidia-compiler.so
W0902 22:11:39.919230 2840718 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so
W0902 22:11:39.919242 2840718 nvc_info.c:403] missing compat32 library libnvidia-ngx.so W0902 22:11:39.919254 2840718 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so W0902 22:11:39.919266 2840718 nvc_info.c:403] missing compat32 library libnvidia-encode.so W0902 22:11:39.919279 2840718 nvc_info.c:403] missing compat32 library libnvidia-opticalflow.so W0902 22:11:39.919291 2840718 nvc_info.c:403] missing compat32 library libnvcuvid.so W0902 22:11:39.919304 2840718 nvc_info.c:403] missing compat32 library libnvidia-eglcore.so
W0902 22:11:39.919317 2840718 nvc_info.c:403] missing compat32 library libnvidia-glcore.so
W0902 22:11:39.919329 2840718 nvc_info.c:403] missing compat32 library libnvidia-tls.so W0902 22:11:39.919341 2840718 nvc_info.c:403] missing compat32 library libnvidia-glsi.so W0902 22:11:39.919353 2840718 nvc_info.c:403] missing compat32 library libnvidia-fbc.so W0902 22:11:39.919365 2840718 nvc_info.c:403] missing compat32 library libnvidia-ifr.so W0902 22:11:39.919377 2840718 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so W0902 22:11:39.919388 2840718 nvc_info.c:403] missing compat32 library libnvoptix.so W0902 22:11:39.919401 2840718 nvc_info.c:403] missing compat32 library libGLX_nvidia.so W0902 22:11:39.919413 2840718 nvc_info.c:403] missing compat32 library libEGL_nvidia.so W0902 22:11:39.919426 2840718 nvc_info.c:403] missing compat32 library libGLESv2_nvidia.so W0902 22:11:39.919438 2840718 nvc_info.c:403] missing compat32 library libGLESv1_CM_nvidia.so W0902 22:11:39.919451 2840718 nvc_info.c:403] missing compat32 library libnvidia-glvkspirv.so W0902 22:11:39.919463 2840718 nvc_info.c:403] missing compat32 library libnvidia-cbl.so I0902 22:11:39.919856 2840718 nvc_info.c:299] selecting /usr/bin/nvidia-smi I0902 22:11:39.919895 2840718 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump I0902 22:11:39.919931 2840718 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced I0902 22:11:39.919985 2840718 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control I0902 22:11:39.920022 2840718 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server W0902 22:11:39.920096 2840718 nvc_info.c:425] missing binary nv-fabricmanager I0902 22:11:39.920152 2840718 nvc_info.c:343] listing firmware path /usr/lib/firmware/nvidia/510.68.02/gsp.bin I0902 22:11:39.920200 2840718 nvc_info.c:529] listing device /dev/nvidiactl
I0902 22:11:39.920215 2840718 nvc_info.c:529] listing device /dev/nvidia-uvm
I0902 22:11:39.920228 2840718 nvc_info.c:529] listing device /dev/nvidia-uvm-tools
I0902 22:11:39.920240 2840718 nvc_info.c:529] listing device /dev/nvidia-modeset
W0902 22:11:39.920281 2840718 nvc_info.c:349] missing ipc path /var/run/nvidia-persistenced/socket
W0902 22:11:39.920324 2840718 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket
W0902 22:11:39.920355 2840718 nvc_info.c:349] missing ipc path /tmp/nvidia-mps
I0902 22:11:39.920371 2840718 nvc_info.c:822] requesting device information with ''
I0902 22:11:39.927586 2840718 nvc_info.c:713] listing device /dev/nvidia0 (GPU-9c416c82-d801-d28f-0867-dd438d4be914 at 00000000:04:00.0)
I0902 22:11:39.934626 2840718 nvc_info.c:713] listing device /dev/nvidia1 (GPU-32a56b8c-943e-03e7-d539-3e97e5ef5f7a at 00000000:05:00.0)
I0902 22:11:39.941796 2840718 nvc_info.c:713] listing device /dev/nvidia2 (GPU-a0e33485-87cd-ceb1-2702-2c58a64a9dbe at 00000000:08:00.0) I0902 22:11:39.949011 2840718 nvc_info.c:713] listing device /dev/nvidia3 (GPU-1ab2485c-121c-77db-6719-0b616d1673f4 at 00000000:09:00.0) I0902 22:11:39.956304 2840718 nvc_info.c:713] listing device /dev/nvidia4 (GPU-e7e3d7b6-ddce-355a-7988-80c4ba18319c at 00000000:0b:00.0)
I0902 22:11:39.963862 2840718 nvc_info.c:713] listing device /dev/nvidia5 (GPU-c16444fb-bedb-106d-c188-1f330773cf39 at 00000000:84:00.0)
I0902 22:11:39.971543 2840718 nvc_info.c:713] listing device /dev/nvidia6 (GPU-2545ac9e-3ff1-8b38-8ad6-b8c82fea6cd0 at 00000000:85:00.0)
I0902 22:11:39.979406 2840718 nvc_info.c:713] listing device /dev/nvidia7 (GPU-fcc35ab7-1afd-e678-b5f0-d1e1f8842d28 at 00000000:89:00.0)
I0902 22:11:39.979522 2840718 nvc_mount.c:366] mounting tmpfs at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia
I0902 22:11:39.980084 2840718 nvc_mount.c:134] mounting /usr/bin/nvidia-smi at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/bin/nvidia-smi
I0902 22:11:39.980181 2840718 nvc_mount.c:134] mounting /usr/bin/nvidia-debugdump at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/bin/nvidia-debugdump
I0902 22:11:39.980273 2840718 nvc_mount.c:134] mounting /usr/bin/nvidia-persistenced at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/bin/nvidia-persistenced
I0902 22:11:39.980360 2840718 nvc_mount.c:134] mounting /usr/bin/nvidia-cuda-mps-control at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/bin/nvidia-cuda-mps-control
I0902 22:11:39.980443 2840718 nvc_mount.c:134] mounting /usr/bin/nvidia-cuda-mps-server at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/bin/nvidia-cuda-mps-server
I0902 22:11:39.980696 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.510.68.02
I0902 22:11:39.980795 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.510.68.02
I0902 22:11:39.980919 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libcuda.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libcuda.so.510.68.02 I0902 22:11:39.981004 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.510.68.02 I0902 22:11:39.981090 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.510.68.02 I0902 22:11:39.981182 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.510.68.02 I0902 22:11:39.981272 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.510.68.02 I0902 22:11:39.981314 2840718 nvc_mount.c:527] creating symlink /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1 I0902 22:11:39.981482 2840718 nvc_mount.c:134] mounting /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/local/cuda-11.4/compat/libcuda.so.470.129.06 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libcuda.so.470.129.06 I0902 22:11:39.981569 2840718 nvc_mount.c:134] mounting /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/local/cuda-11.4/compat/libnvidia-ptxjitcompiler.so.470.129.06 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.129.06 I0902 22:11:39.981887 2840718 nvc_mount.c:85] mounting /usr/lib/firmware/nvidia/510.68.02/gsp.bin at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/lib/firmware/nvidia/510.68.02/gsp.bin with flags 0x7 I0902 22:11:39.981971 2840718 nvc_mount.c:230] mounting /dev/nvidiactl at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidiactl I0902 22:11:39.982876 2840718 nvc_mount.c:230] mounting /dev/nvidia-uvm at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia-uvm I0902 22:11:39.983470 2840718 nvc_mount.c:230] mounting /dev/nvidia-uvm-tools at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia-uvm-tools I0902 22:11:39.983976 2840718 nvc_mount.c:230] mounting /dev/nvidia0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia0 I0902 22:11:39.984099 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:04:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:04:00.0 I0902 22:11:39.984695 2840718 nvc_mount.c:230] mounting /dev/nvidia1 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia1 I0902 22:11:39.984812 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:05:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:05:00.0 I0902 22:11:39.985425 2840718 nvc_mount.c:230] mounting /dev/nvidia2 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia2 I0902 22:11:39.985541 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:08:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:08:00.0 I0902 22:11:39.986207 2840718 nvc_mount.c:230] mounting /dev/nvidia3 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia3 I0902 22:11:39.986322 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:09:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:09:00.0 I0902 22:11:39.986963 2840718 nvc_mount.c:230] mounting /dev/nvidia4 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia4 I0902 22:11:39.987076 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:0b:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:0b:00.0 I0902 22:11:39.987794 2840718 nvc_mount.c:230] mounting /dev/nvidia5 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia5 I0902 22:11:39.987907 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:84:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:84:00.0 I0902 22:11:39.988593 2840718 nvc_mount.c:230] mounting /dev/nvidia6 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia6 I0902 22:11:39.988707 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:85:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:85:00.0 I0902 22:11:39.989388 2840718 nvc_mount.c:230] mounting /dev/nvidia7 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia7 I0902 22:11:39.989515 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:89:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:89:00.0 I0902 22:11:39.990197 2840718 nvc_ldcache.c:372] executing /sbin/ldconfig.real from host at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged I0902 22:11:40.012422 2840718 nvc.c:434] shutting down library context I0902 22:11:40.012510 2840726 rpc.c:95] terminating nvcgo rpc service I0902 22:11:40.013110 2840718 rpc.c:135] nvcgo rpc service terminated successfully I0902 22:11:40.018693 2840725 rpc.c:95] terminating driver rpc service I0902 22:11:40.018995 2840718 rpc.c:135] driver rpc service terminated successfully
- [x] Docker command, image and tag used
docker run --gpus all --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash
nvidia-smi
Other open issues
NVIDIA/nvidia-container-toolkit#251 but this is using cgroup v1 NVIDIA/nvidia-docker#1661 there isn't any information posted and it's on Ubuntu 20.04 instead of 22.04
Important notes / workaround
containerd.io v1.6.7 or v1.6.8 even with no-cgroups = true
in /etc/nvidia-container-runtime/config.toml
and specifying the devices to docker run
gives Failed to initialize NVML: Unknown Error
after a systemctl daemon-reload
.
Downgrading containerd.io to 1.6.6 works as long as you specify no-cgroups = true
in /etc/nvidia-container-runtime/config.toml
and specify the devices to docker run
like docker run --gpus all --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia-modeset:/dev/nvidia-modeset --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools --device /dev/nvidiactl:/dev/nvinvidiactl --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash
@elezar Previously persistence mode was off, so this happens either way.
Also, on k8s-device-plugin/issues/289 @klueska said:
The only thing we've seen that fully resolves the issue is to upgrade to an "experimental" version of our NVIDIA container runtime that bypasses the need for libnvidia-container to change cgroup permissions out from underneath runC.
Was that merged, or is it something I should try?
@kevin-bockman the experimental mode is still a work in progress and we don't have a concrete timeline on when this will be available for testing. I will update the issue here as soon as I have more information.
The other option is to move to cgroupv2
. Since devices
are not an actual subsytem in cgroupv2
, there is no chance for containerd to undo what libnvidia-container
has done under the hood after a restart.
@klueska Sorry, with all of the information, it wasn't really clear. The problem is that it's already on cgroupv2 AFAIK. I started from a fresh install of Ubuntu 22.04.1. docker info
says it is at least.
The only way I could get this to work after a systemctl daemon-reload
is downgrading containerd.io to 1.6.6 and specify no-cgroups. The other interesting thing is with containerd v1.6.7 or v1.6.8, even specifying no-cgroups still had the issue so I'm wondering if there's more than 1 issue here. I know cgroup v2 has 'fixed' the issue for some people or so they think (this can be an intermittent issue if you don't know that the reload triggers it), but it hasn't seemed to fix it for everyone unless I'm missing something but it doesn't work on a fresh install after doing a daemon reload, or just waiting for something to be triggered by the OS.
$ docker info
Client:
Context: default
Debug Mode: false
Plugins:
app: Docker App (Docker Inc., v0.9.1-beta3)
buildx: Docker Buildx (Docker Inc., v0.8.2-docker)
Server:
Containers: 4
Running: 4
Paused: 0
Stopped: 0
Images: 4
Server Version: 20.10.17
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
runc version: v1.1.4-0-g5fd4c4d
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: default
cgroupns
Kernel Version: 5.15.0-46-generic
Operating System: Ubuntu 22.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 32
Total Memory: 94.36GiB
Name: node5-4
ID: PPB6:APYD:PKMA:BIOZ:2Y3H:LZUV:TPHD:SBZE:XRSL:NJCB:PWMX:ZVBY
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
@kevin-bockman I had a similar experience.
In my case,
docker run -it --device /dev/nvidiactl:/dev/nvidiactl \
--device /dev/nvidia-uvm:/dev/nvidia-uvm \
--device /dev/nvidia0:/dev/nvidia0 \
--device /dev/nvidia1:/dev/nvidia1 \
--device /dev/nvidia2:/dev/nvidia2 \
--device /dev/nvidia3:/dev/nvidia3 \
--name <container_name> <image_name>
(Replace/repeat nvidia0 with other/more devices as needed.)
This setting is working in some machines and not working in other machines. Finally, I found that working machines has containerd.io version 1.4.6-1 (ubuntu 18.04)!!! In ubuntu 20.04 machine, containerd.io which has version 1.5.2-1 makes it work.
I tried to downgrade and upgrade the version of containerd.io to check this strategy works or not. It works for me.
Above one is not the answer...
This prevents nmvl error from docker resource update, but nvml error still occurs after random amount of time.
Same issue. Ubuntu 22,docker ce. I will just end up writing a cron job script to check for the error and restart the container
The solution proposed by @kevin-bockman has been working without any problem for more than a month now.
Downgrading containerd.io to 1.6.6 works as long as you specify no-cgroups = true in /etc/nvidia-container-runtime/config.toml and specify the devices to docker run like docker run --gpus all --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia-modeset:/dev/nvidia-modeset --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools --device /dev/nvidiactl:/dev/nvinvidiactl --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash
I am using docker-ce on Ubuntu 22, so I opted for this approach, working fine so far.
same issue on Nvidia 3090 Ubuntu 22.04.1 LTS, Driver Version: 510.85.02 CUDA Version: 11.6
Hello there.
I'm hitting the same issue here, but with containerd
rather than docker
.
Here's my configuration:
-
GPUs:
# lspci | grep -i nvidia 00:04.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1)
-
OS:
# cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=22.04 DISTRIB_CODENAME=jammy DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"
-
containerd release:
# containerd --version containerd containerd.io 1.6.8 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
-
nvidia-container-toolkit version:
# nvidia-container-toolkit -version NVIDIA Container Runtime Hook version 1.11.0 commit: d9de4a0
-
runc version:
# runc --version runc version 1.1.4 commit: v1.1.4-0-g5fd4c4d spec: 1.0.2-dev go: go1.17.13 libseccomp: 2.5.1
Note that the Nvidia's container toolkit has been installed with the Nvidia's GPU operator on Kubernetes (v1.25.3).
I attached the containerd configuration file and the nvidia-container-runtime configuration file to my comment. containerd.txt nvidia-container-runtime.txt
How I reproduce this bug:
Running on my host the following command:
# nerdctl run -n k8s.io --runtime=/usr/local/nvidia/toolkit/nvidia-container-runtime --network=host --rm -ti --name ubuntu --gpus all -v /run/nvidia/driver/usr/bin:/tmp/nvidia-bin docker.io/library/ubuntu:latest bash
After some time, the nvidia-smi
command exits with the error Failed to initialize NVML: Unknown Error
.
Traces, logs, etc...
- Here are the devices listed in the
state.json
file:{ "type": 99, "major": 195, "minor": 255, "permissions": "", "allow": false, "path": "/dev/nvidiactl", "file_mode": 438, "uid": 0, "gid": 0 }, { "type": 99, "major": 234, "minor": 0, "permissions": "", "allow": false, "path": "/dev/nvidia-uvm", "file_mode": 438, "uid": 0, "gid": 0 }, { "type": 99, "major": 234, "minor": 1, "permissions": "", "allow": false, "path": "/dev/nvidia-uvm-tools", "file_mode": 438, "uid": 0, "gid": 0 }, { "type": 99, "major": 195, "minor": 254, "permissions": "", "allow": false, "path": "/dev/nvidia-modeset", "file_mode": 438, "uid": 0, "gid": 0 }, { "type": 99, "major": 195, "minor": 0, "permissions": "", "allow": false, "path": "/dev/nvidia0", "file_mode": 438, "uid": 0, "gid": 0 }
Thank you very much for your help. 🙏
Here I wrote the detailed steps how I fixed this issue in our env with cgroup v2. Let me know if it works in your env.
https://gist.github.com/gengwg/55b3eb2bc22bcbd484fccbc0978484fc
Here I wrote the detailed steps how I fixed this issue in our env with cgroup v2. Let me know if it works in your env.
https://gist.github.com/gengwg/55b3eb2bc22bcbd484fccbc0978484fc
@gengwg Can you try if your solution works by calling sudo systemctl daemon-reload
on the host? In my case (cgroupv1), it is directly breaking the pod ; so from the pod, nvidia-smi
is returning Failed to initialize NVML: Unknown Error
.
yes. that's actually the first thing i tested when upgraded v1 --> v2. it's easy to test, because it doesn't need wait a few hours/days.
to double check, i just tested it again right now.
Before:
$ k exec -it gengwg-test-gpu-9 -- nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-212a30ad-0ea4-8201-1be0-cdc575e55034)
Do the reload on that node itself:
# systemctl daemon-reload
After:
$ k exec -it gengwg-test-gpu-9 -- nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-212a30ad-0ea4-8201-1be0-cdc575e55034)
I will update the note to reflect this test too.
And I can also confirm that's what I saw on our cgroupv1 nodes too, i.e. sudo systemctl daemon-reload
immediately breaks nvidia-smi
.
Here I wrote the detailed steps how I fixed this issue in our env with cgroup v2. Let me know if it works in your env.
https://gist.github.com/gengwg/55b3eb2bc22bcbd484fccbc0978484fc
Hi, what's your cgroup driver for kubelet and containerd? We meed the same problem in cgroup v2, our cgroup driver is systemd
, but if we switch the cgroup driver to cgroupfs
, the problem will disappear. I think it's the systemd cgroup driver cause the problem.
Also, if we switch the cgroup driver of docker to cgroupfs, it will also solve the problem.
Important notes / workaround
containerd.io v1.6.7 or v1.6.8 even with
no-cgroups = true
in/etc/nvidia-container-runtime/config.toml
and specifying the devices todocker run
givesFailed to initialize NVML: Unknown Error
after asystemctl daemon-reload
.Downgrading containerd.io to 1.6.6 works as long as you specify
no-cgroups = true
in/etc/nvidia-container-runtime/config.toml
and specify the devices todocker run
likedocker run --gpus all --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia-modeset:/dev/nvidia-modeset --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools --device /dev/nvidiactl:/dev/nvinvidiactl --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash
I've also tried this way, the reason why containerd 1.6.7 can't work is because runc has been updated to 1.1.3, in this version runc will ignore some char devices can't be os.Stat
in this PR. Unfortunately, the GPU related device is that kind of device, so it will go wrong.
@gengwg Thanks for sharing your document. As I run my kubernetes cluster on ubuntu 22.04, cgroupv2 is the default cgroup subsystem used.
I deployed two environments to help me making some comparisons:
- One environment is running kubernetes v1.25.3, with Nvidia's GPU operator
- One environment with only containerd & nvidia-container-toolkit
Interestingly, I never face this issue on the second environment, everything is running perfectly well.
The first environment though is running into this issue after some time.
That would probably means that Nvidia's container runtime isn't the faulty component here, but it needs more investigations on my side to be sure that I'm not missing anything.
I'll have a look at the cgroup driver as @panli889 mentioned.
Thanks again for your help
cgroup driver for kubelet, docker and containerd are all systemd
. In fact, in cgroupv1 we used to use cgroupfs, but kubelet won't start, complaining mismatch between kubelet and docker cgroup drivers. After that I changed the docker (and containerd) cgroup driver to systemd, kubelet was able to start.
# cat /etc/systemd/system/kubelet.service | grep -i cgroup
--runtime-cgroups=/systemd/system.slice \
--kubelet-cgroups=/systemd/system.slice \
--cgroup-driver=systemd \
We are in the middle of migrating docker to containerd, so we have both docker and containerd nodes. This seem fixed it for BOTH.
Docker nodes:
# docker info | grep -i cgroup
WARNING: No swap limit support
Cgroup Driver: systemd
Cgroup Version: 2
cgroupns
Containerd nodes:
$ sudo crictl info | grep -i cgroup
"SystemdCgroup": true
"SystemdCgroup": true
"systemdCgroup": false,
"disableCgroup": false,
Here is our k8s version:
$ k version --short
Client Version: v1.21.3
Server Version: v1.22.9
@gengwg Thanks for sharing your document. As I run my kubernetes cluster on ubuntu 22.04, cgroupv2 is the default cgroup subsystem used.
I deployed two environments to help me making some comparisons:
- One environment is running kubernetes v1.25.3, with Nvidia's GPU operator
- One environment with only containerd & nvidia-container-toolkit
Interestingly, I never face this issue on the second environment, everything is running perfectly well.
The first environment though is running into this issue after some time.
That would probably means that Nvidia's container runtime isn't the faulty component here, but it needs more investigations on my side to be sure that I'm not missing anything.
I'll have a look at the cgroup driver as @panli889 mentioned.
Thanks again for your help
I think ours is similar to your 2nd env, i.e. containerd & nvidia-container-toolkit. we are on k8s v1.22.9.
# containerd --version
containerd containerd.io 1.6.6 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
# dnf info nvidia-container-toolkit | grep Version
Version : 1.11.0
i posted cgroup driver info above.
@gengwg thx for your reply!
cgroup driver for kubelet, docker and containerd are all systemd.
Hmm, that's interesting, it's quite different from my situation. Would you please share your systemd version?
I can share the problems we meet, if we create a pod with GPU, there will be a related systemd scope created at the same time like cri-containerd-xxxxxx.scope
, and it records the cgroup info, if we run systemctl status
to check the status:
Warning: The unit file, source configuration file or drop-ins of cri-containerd-xxxxx.scope changed on disk. Run 'systemctl daemon-reload' to reload units.
● cri-containerd-xxx.scope - libcontainer container xxxx
Loaded: loaded (/run/systemd/transient/cri-containerd-xxxx.scope; transient)
Transient: yes
Drop-In: /run/systemd/transient/cri-containerd-xxxxx.scope.d
└─50-DevicePolicy.conf, 50-DeviceAllow.conf, 50-CPUWeight.conf, 50-CPUQuotaPeriodSec.conf, 50-CPUQuota.conf, 50-AllowedCPUs.conf
Active: active (running) since Fri 2022-11-25 12:13:33 +08; 1min 47s ago
IO: 404.0K read, 0B written
Tasks: 1
Memory: 528.0K
CPU: 2.562s
CGroup: /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podb6b36d39_ef5b_4eb9_850d_d710bbd06096.slice/cri-containerd-xxx.scope>
└─61265 sleep infinity
And if we check the content of file 50-DeviceAllow.conf
, we found no GPU devices info in there. Then if we run systemctl daemon-reload
, if will reproduce a ebpf cgroup program about the devices, and it will block the access of GPU devices.
So would you please also take a look at the content of DeviceAllow.conf
for some systemd scope of pod, what's in there?
Same issue with 2 x Nvidia 3090 Ti, Ubuntu 22.04.1 LTS, Driver Version: 510.85.02, CUDA Version: 11.6
I adopted the solution proposed by @kevin-bockman downgrading containerd.io from 1.6.10 to 1.6.6. After running systemctl daemon-reload
on the host machine the nvidia-smi within the container still works properly. I will check how long it lasts and I'll keep you updated.
@panli889 I checked the scope unit with systemctl status
, and this message popped up:
Warning: The unit file, source configuration file or drop-ins of cri-containerd-d35333ac42f1e08a33632fccd63028a28443f95f3c126860a8c9da20b6d27102.scope changed on disk. Run 'systemctl daemon-reload' to reload units.
After running systemctl daemon-reload
, I get the error on my container:
root@ubuntu:/# nvidia-smi
Failed to initialize NVML: Unknown Error
Here's the content of the 50-DeviceAllow.conf
file:
[Scope]
DeviceAllow=
DeviceAllow=char-pts rwm
DeviceAllow=/dev/char/10:200 rwm
DeviceAllow=/dev/char/5:2 rwm
DeviceAllow=/dev/char/5:0 rwm
DeviceAllow=/dev/char/1:9 rwm
DeviceAllow=/dev/char/1:8 rwm
DeviceAllow=/dev/char/1:7 rwm
DeviceAllow=/dev/char/1:5 rwm
DeviceAllow=/dev/char/1:3 rwm
DeviceAllow=char-* m
DeviceAllow=block-* m
There's indeed no reference to nvidia's devices here:
crw-rw-rw- 1 root root 195, 254 Nov 29 10:18 nvidia-modeset
crw-rw-rw- 1 root root 234, 0 Nov 29 10:18 nvidia-uvm
crw-rw-rw- 1 root root 234, 1 Nov 29 10:18 nvidia-uvm-tools
crw-rw-rw- 1 root root 195, 0 Nov 29 10:18 nvidia0
crw-rw-rw- 1 root root 195, 255 Nov 29 10:18 nvidiactl
nvidia-caps:
total 0
cr-------- 1 root root 237, 1 Nov 29 10:18 nvidia-cap1
cr--r--r-- 1 root root 237, 2 Nov 29 10:18 nvidia-cap2
@fradsj thanks for your reply, seems the same problem as us.
Here is how we solve it, hope it will help:
- Add
--pass-device-specs=true
to your k8s-device-plugin like this comment said https://github.com/NVIDIA/nvidia-docker/issues/966#issuecomment-610928514 . This param will ensure GPU devices are returned by the device plugin instead of just setting the env when allocating, then the50-DeviceAllow.conf
will include GPU device info. - Ensure the runc version is under 1.1.3, as I mentioned above, runc 1.1.3 introduced an PR, it will ignore the GPU devices passed to runc in step one. https://github.com/opencontainers/runc/issues/3671
Hi,
Any official way to fix this error ?
The official way is in the works.
It is based on using a new specification called CDI to do the GPU device injection, rather than relying a runc
hook to do the GPU device injection behind the back of containerd (which is a fundamental / architectural flaw of the existing nvidia-container-runtime
, and is the underlying cause of all these problems).
Until a version of both (1) the nvidia-container-runtime
and (2) the k8s-device-plugin
are released with proper support for CDI, you will need to rely on one of the workarounds described here.
There is no "official" workaround as such, but the workaround described in https://github.com/NVIDIA/nvidia-docker/issues/1671#issuecomment-1330466432 seems like the best one from my perspective. It relies on the already documented use of --pass-device-specs=true
in the k8s-device-plugin (which has been the workaround for years until now) combined with downgrading to a version of runc
which doesn't trigger the GPUs to be ignored.
Hmm, that's interesting, it's quite different from my situation. Would you please share your systemd version?
I can share the problems we meet, if we create a pod with GPU, there will be a related systemd scope created at the same time like
cri-containerd-xxxxxx.scope
, and it records the cgroup info, if we runsystemctl status
to check the status:Warning: The unit file, source configuration file or drop-ins of cri-containerd-xxxxx.scope changed on disk. Run 'systemctl daemon-reload' to reload units. ● cri-containerd-xxx.scope - libcontainer container xxxx Loaded: loaded (/run/systemd/transient/cri-containerd-xxxx.scope; transient) Transient: yes Drop-In: /run/systemd/transient/cri-containerd-xxxxx.scope.d └─50-DevicePolicy.conf, 50-DeviceAllow.conf, 50-CPUWeight.conf, 50-CPUQuotaPeriodSec.conf, 50-CPUQuota.conf, 50-AllowedCPUs.conf Active: active (running) since Fri 2022-11-25 12:13:33 +08; 1min 47s ago IO: 404.0K read, 0B written Tasks: 1 Memory: 528.0K CPU: 2.562s CGroup: /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podb6b36d39_ef5b_4eb9_850d_d710bbd06096.slice/cri-containerd-xxx.scope> └─61265 sleep infinity
And if we check the content of file
50-DeviceAllow.conf
, we found no GPU devices info in there. Then if we runsystemctl daemon-reload
, if will reproduce a ebpf cgroup program about the devices, and it will block the access of GPU devices.So would you please also take a look at the content of
DeviceAllow.conf
for some systemd scope of pod, what's in ther
@panli889 sorry for late reply. was on vacation.
systemd version:
$ systemctl --version
systemd 239 (239-58.el8)
After spinning up a pod on a node:
$ k exec -it gengwg-test-gpu-9 -- nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-3836675c-e987-1f01-7ce7-12da20038909)
I don't see the systemd scope nor the DeviceAllow files.
$ find /etc/systemd/ | grep scope
$ sudo find /etc/ | grep -i DeviceAllow
Checked those on our env.
Here is how we solve it, hope it will help:
- Add
--pass-device-specs=true
to your k8s-device-plugin like this comment said Updating cpu-manager-policy=static causes NVML unknown error #966 (comment) . This param will ensure GPU devices are returned by the device plugin instead of just setting the env when allocating, then the50-DeviceAllow.conf
will include GPU device info.
We didn't use the --pass-device-specs=true
option, but we do have allowPrivilegeEscalation: false
. looks not the same thing.
$ k get ds nvidia-device-plugin-daemonset -n kube-system -o yaml
....
spec:
containers:
- args:
- --fail-on-init-error=false
image: xxxxx.com/k8s-device-plugin:v0.9.0
imagePullPolicy: IfNotPresent
name: nvidia-device-plugin-ctr
resources: {}
securityContext:
allowPrivilegeEscalation: false # <------
capabilities:
drop:
- ALL
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/kubelet/device-plugins
name: device-plugin
dnsPolicy: ClusterFirst
....
- Ensure the runc version is under 1.1.3, as I mentioned above, runc 1.1.3 introduced an PR, it will ignore the GPU devices passed to runc in step one. Nvidia GPU devices in systemd will be ignored after 1.1.3 opencontainers/runc#3671
Luckily we are right below 1.1.3. We pinned the version on the repo side through centos composes, so this should be safe if we do not advance the compose version.
$ runc --version
runc version 1.1.2
commit: v1.1.2-0-ga916309
spec: 1.0.2-dev
go: go1.17.11
libseccomp: 2.5.2