nvidia-docker
nvidia-docker copied to clipboard
Got `docker: Error response from daemon: OCI runtime create failed:` only while NVLink attached
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Also, before reporting a new issue, please make sure that:
- You read carefully the documentation and frequently asked questions.
- You searched for a similar issue and this is not a duplicate of an existing one.
- This issue is not related to NGC, otherwise, please use the devtalk forums instead.
- You went through the troubleshooting steps.
1. Issue or feature description
Get error message while run docker run --gpus all nvidia/cuda:10.1-runtime nvidia-smi
.
However, this command can went through successfully if physically removed the NVLink bridge.
docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused \"process_linux.go:413: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: timed out\\\\n\\\"\"": unknown.
ERRO[0032] error waiting for container: context canceled
2. Steps to reproduce the issue
run docker run --gpus all nvidia/cuda:10.1-runtime nvidia-smi
while NVLink bridge is attached. It will take a while the whole system is not responsive.
3. Information to attach (optional if deemed irrelevant)
- [x] Some nvidia-container information:
nvidia-container-cli -k -d /dev/tty info
-- WARNING, the following logs are for debugging purposes only --
I0912 02:27:54.664955 8010 nvc.c:281] initializing library context (version=1.0.5, build=13b836390888f7b7c7dca115d16d7e28ab15a836)
I0912 02:27:54.665002 8010 nvc.c:255] using root /
I0912 02:27:54.665007 8010 nvc.c:256] using ldcache /etc/ld.so.cache
I0912 02:27:54.665011 8010 nvc.c:257] using unprivileged user 65534:65534
I0912 02:27:54.666601 8011 nvc.c:191] loading kernel module nvidia
I0912 02:27:54.666880 8011 nvc.c:203] loading kernel module nvidia_uvm
I0912 02:27:54.667030 8011 nvc.c:211] loading kernel module nvidia_modeset
I0912 02:27:54.667366 8012 driver.c:133] starting driver service
W0912 02:28:19.702479 8010 driver.c:220] terminating driver service (forced)
I0912 02:28:26.875583 8010 driver.c:233] driver service terminated with signal 15
nvidia-container-cli: initialization error: driver error: timed out
- [x] Kernel version from
uname -a
Linux ThreadRipperRTX 5.0.0-27-generic #28~18.04.1-Ubuntu SMP Thu Aug 22 03:00:32 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
- [ ] Any relevant kernel output lines from
dmesg
- [x] Driver information from
nvidia-smi -a
==============NVSMI LOG==============
Timestamp : Wed Sep 11 22:29:19 2019
Driver Version : 418.87.00
CUDA Version : 10.1
Attached GPUs : 3
GPU 00000000:08:00.0
Product Name : GeForce RTX 2080 Ti
Product Brand : GeForce
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-13b5329a-0931-95b3-50cf-6532e95475ed
Minor Number : 0
VBIOS Version : 90.02.17.00.5F
MultiGPU Board : No
Board ID : 0x800
GPU Part Number : N/A
Inforom Version
Image Version : G001.0000.02.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x08
Device : 0x00
Domain : 0x0000
Device Id : 0x1E0410DE
Bus Id : 00000000:08:00.0
Sub System Id : 0x250319DA
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 8x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 29 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 10989 MiB
Used : 1 MiB
Free : 10988 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 3 MiB
Free : 253 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Temperature
GPU Current Temp : 33 C
GPU Shutdown Temp : 94 C
GPU Slowdown Temp : 91 C
GPU Max Operating Temp : 89 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 22.14 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 100.00 W
Max Power Limit : 280.00 W
Clocks
Graphics : 300 MHz
SM : 300 MHz
Memory : 405 MHz
Video : 540 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 2100 MHz
SM : 2100 MHz
Memory : 7000 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None
GPU 00000000:42:00.0
Product Name : GeForce RTX 2080 Ti
Product Brand : GeForce
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-fb3f29fb-86f4-c8d3-1258-77518bc07ff8
Minor Number : 1
VBIOS Version : 90.02.17.00.5F
MultiGPU Board : No
Board ID : 0x4200
GPU Part Number : N/A
Inforom Version
Image Version : G001.0000.02.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x42
Device : 0x00
Domain : 0x0000
Device Id : 0x1E0410DE
Bus Id : 00000000:42:00.0
Sub System Id : 0x250319DA
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 8x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 30 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 10989 MiB
Used : 1 MiB
Free : 10988 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 3 MiB
Free : 253 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Temperature
GPU Current Temp : 35 C
GPU Shutdown Temp : 94 C
GPU Slowdown Temp : 91 C
GPU Max Operating Temp : 89 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 8.52 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 100.00 W
Max Power Limit : 280.00 W
Clocks
Graphics : 300 MHz
SM : 300 MHz
Memory : 405 MHz
Video : 540 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 2100 MHz
SM : 2100 MHz
Memory : 7000 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None
GPU 00000000:43:00.0
Product Name : GeForce RTX 2070
Product Brand : GeForce
Display Mode : Enabled
Display Active : Enabled
Persistence Mode : Enabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-2499258b-2519-1e61-06d0-f9aae805c21c
Minor Number : 2
VBIOS Version : 90.06.16.00.17
MultiGPU Board : No
Board ID : 0x4300
GPU Part Number : N/A
Inforom Version
Image Version : G001.0000.02.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x43
Device : 0x00
Domain : 0x0000
Device Id : 0x1F0710DE
Bus Id : 00000000:43:00.0
Sub System Id : 0x21723842
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 7951 MiB
Used : 482 MiB
Free : 7469 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 4 MiB
Free : 252 MiB
Compute Mode : Default
Utilization
Gpu : 1 %
Memory : 10 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Temperature
GPU Current Temp : 48 C
GPU Shutdown Temp : 94 C
GPU Slowdown Temp : 91 C
GPU Max Operating Temp : 89 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 18.58 W
Power Limit : 185.00 W
Default Power Limit : 185.00 W
Enforced Power Limit : 185.00 W
Min Power Limit : 105.00 W
Max Power Limit : 240.00 W
Clocks
Graphics : 300 MHz
SM : 300 MHz
Memory : 405 MHz
Video : 540 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 2160 MHz
SM : 2160 MHz
Memory : 7001 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
Process ID : 1518
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 341 MiB
Process ID : 2251
Type : G
Name : /usr/bin/gnome-shell
Used GPU Memory : 138 MiB
- [x] Docker version from
docker version
Client: Docker Engine - Community
Version: 19.03.2
API version: 1.40
Go version: go1.12.8
Git commit: 6a30dfc
Built: Thu Aug 29 05:29:11 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.2
API version: 1.40 (minimum version 1.12)
Go version: go1.12.8
Git commit: 6a30dfc
Built: Thu Aug 29 05:27:45 2019
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.2.6
GitCommit: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
runc:
Version: 1.0.0-rc8
GitCommit: 425e105d5a03fabd737a126ad93d62a9eeede87f
docker-init:
Version: 0.18.0
GitCommit: fec3683
- [x] NVIDIA packages version from
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-================================-==========================-============-=========================================================
un libgldispatch0-nvidia <none> <none> (no description available)
ii libnvidia-cfg1-418:amd64 418.87.00-0ubuntu1 amd64 NVIDIA binary OpenGL/GLX configuration library
un libnvidia-cfg1-any <none> <none> (no description available)
un libnvidia-common <none> <none> (no description available)
ii libnvidia-common-418 418.87.00-0ubuntu1 all Shared files used by the NVIDIA libraries
rc libnvidia-compute-410:amd64 410.104-0ubuntu0~18.04.1 amd64 NVIDIA libcompute package
ii libnvidia-compute-418:amd64 418.87.00-0ubuntu1 amd64 NVIDIA libcompute package
rc libnvidia-compute-430:amd64 430.40-0ubuntu0~gpu18.04.1 amd64 NVIDIA libcompute package
rc libnvidia-compute-435:amd64 435.21-0ubuntu0~18.04.2 amd64 NVIDIA libcompute package
ii libnvidia-container-tools 1.0.5-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.0.5-1 amd64 NVIDIA container runtime library
un libnvidia-decode <none> <none> (no description available)
ii libnvidia-decode-418:amd64 418.87.00-0ubuntu1 amd64 NVIDIA Video Decoding runtime libraries
un libnvidia-encode <none> <none> (no description available)
ii libnvidia-encode-418:amd64 418.87.00-0ubuntu1 amd64 NVENC Video Encoding runtime library
un libnvidia-fbc1 <none> <none> (no description available)
ii libnvidia-fbc1-418:amd64 418.87.00-0ubuntu1 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
un libnvidia-gl <none> <none> (no description available)
ii libnvidia-gl-418:amd64 418.87.00-0ubuntu1 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
un libnvidia-ifr1 <none> <none> (no description available)
ii libnvidia-ifr1-418:amd64 418.87.00-0ubuntu1 amd64 NVIDIA OpenGL-based Inband Frame Readback runtime library
un libnvidia-ml1 <none> <none> (no description available)
un nvidia-304 <none> <none> (no description available)
un nvidia-340 <none> <none> (no description available)
un nvidia-384 <none> <none> (no description available)
un nvidia-390 <none> <none> (no description available)
un nvidia-common <none> <none> (no description available)
ii nvidia-compute-utils-418 418.87.00-0ubuntu1 amd64 NVIDIA compute utilities
un nvidia-container-runtime <none> <none> (no description available)
un nvidia-container-runtime-hook <none> <none> (no description available)
ii nvidia-container-toolkit 1.0.5-1 amd64 NVIDIA container runtime hook
ii nvidia-dkms-418 418.87.00-0ubuntu1 amd64 NVIDIA DKMS package
un nvidia-dkms-kernel <none> <none> (no description available)
ii nvidia-driver-418 418.87.00-0ubuntu1 amd64 NVIDIA driver metapackage
un nvidia-driver-binary <none> <none> (no description available)
un nvidia-kernel-common <none> <none> (no description available)
ii nvidia-kernel-common-418 418.87.00-0ubuntu1 amd64 Shared files used with the kernel module
un nvidia-kernel-source <none> <none> (no description available)
ii nvidia-kernel-source-418 418.87.00-0ubuntu1 amd64 NVIDIA kernel source package
un nvidia-legacy-340xx-vdpau-driver <none> <none> (no description available)
un nvidia-libopencl1-dev <none> <none> (no description available)
ii nvidia-modprobe 418.87.00-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files
un nvidia-opencl-icd <none> <none> (no description available)
un nvidia-persistenced <none> <none> (no description available)
ii nvidia-prime 0.8.8.2 all Tools to enable NVIDIA's Prime
ii nvidia-settings 418.87.00-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
un nvidia-settings-binary <none> <none> (no description available)
un nvidia-smi <none> <none> (no description available)
un nvidia-utils <none> <none> (no description available)
ii nvidia-utils-418 418.87.00-0ubuntu1 amd64 NVIDIA driver support binaries
un nvidia-vdpau-driver <none> <none> (no description available)
ii xserver-xorg-video-nvidia-418 418.87.00-0ubuntu1 amd64 NVIDIA binary Xorg driver
- [x] NVIDIA container library version from
nvidia-container-cli -V
version: 1.0.5
build date: 2019-09-06T16:59+00:00
build revision: 13b836390888f7b7c7dca115d16d7e28ab15a836
build compiler: x86_64-linux-gnu-gcc-7 7.4.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
- [x] NVIDIA container library logs (see troubleshooting)
Nothing was logged.
- [x] Docker command, image and tag used
docker run --gpus all nvidia/cuda:10.1-runtime nvidia-smi
Just wondering if you've solved this eventually... am running into the same problem.
Thanks!
@yucolabjames Not at all. I detached the NVLink bridge in the end so I can keep things rolling. Moreover, I tried several trouble shooting on this forum, none of them work as well.
Reproduced the same behavior after 10/2/2019 updating 'apt update' on Ubuntu 18.04. Removed NVLink as a workaround successfull.
Reproduced the same behavior under Debian 10. Any further progress on this?
I have not tried to figure this issue out later on but things that work for me for now is to switch to Intel platform.
Can anybody else with that problem supply the output of nvidia-bug-report.sh (see https://github.com/NVIDIA/nvidia-docker/issues/1180)?
Hello!
Sorry for the lack of support on this, having the output of nvidia-bug-report.sh would be super helpful to debug this (likely) driver issue. Thanks!
The same question. Is there any solution to solve it?
@jiangxiaobin96 the NVLink devices are not currently being mounted from the host into the container by default. You could try to add --device /dev/nvidia-nvlink
arguments to your docker command line (as well as any /dev/nvidia-nvswitch*
devices that may exist.
Also as mentioned in the comments, the output of nvidia-bug-report.sh
would be helpful.
In general it shouldn’t be necessary to inject these into a container. So long as you have fabric manager running on the host and the fabric manager socket injected in the container (which libnvidia-container should do for you) then things should work as expected regarding nvswitches/nvlinks.