nvidia-docker icon indicating copy to clipboard operation
nvidia-docker copied to clipboard

Error response from daemon: OCI runtime create failed: container_linux.go:370

Open HYL-Dave opened this issue 3 years ago • 16 comments

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Also, before reporting a new issue, please make sure that:


1. Issue or feature description

I got a Error as following: docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: detection error: open failed: /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.470.129.06: no such file or directory: unknown.

2. Steps to reproduce the issue

It occur when I run docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

3. Information to attach (optional if deemed irrelevant)

  • [ ] Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info -- WARNING, the following logs are for debugging purposes only --

I0811 17:18:18.672543 8080 nvc.c:282] initializing library context (version=1.3.0, build=16315ebdf4b9728e899f615e208b50c41d7a5d15) I0811 17:18:18.672559 8080 nvc.c:256] using root / I0811 17:18:18.672561 8080 nvc.c:257] using ldcache /etc/ld.so.cache I0811 17:18:18.672563 8080 nvc.c:258] using unprivileged user 1000:1000 I0811 17:18:18.672572 8080 nvc.c:299] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I0811 17:18:18.672646 8080 nvc.c:301] dxcore initialization failed, continuing assuming a non-WSL environment W0811 17:18:18.693851 8081 nvc.c:187] failed to set inheritable capabilities W0811 17:18:18.693872 8081 nvc.c:188] skipping kernel modules load due to failure I0811 17:18:18.694059 8082 driver.c:101] starting driver service I0811 17:18:18.695451 8080 nvc_info.c:680] requesting driver information with '' I0811 17:18:18.696285 8080 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.470.141.03 nvidia-container-cli: detection error: open failed: /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.470.129.06: no such file or directory I0811 17:18:18.696314 8080 nvc.c:337] shutting down library context I0811 17:18:18.696520 8082 driver.c:156] terminating driver service I0811 17:18:18.696761 8080 driver.c:196] driver service terminated successfully

  • [ ] Kernel version from uname -a Linux hyl-Precision-7540 5.4.0-122-generic #138~18.04.1-Ubuntu SMP Fri Jun 24 14:14:03 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

  • [ ] Any relevant kernel output lines from dmesg

  • [ ] Docker version from docker version

  • [ ] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'

  • [ ] NVIDIA container library version from nvidia-container-cli -V version: 1.3.0 build date: 2020-09-16T12:32+00:00 build revision: 16315ebdf4b9728e899f615e208b50c41d7a5d15 build compiler: x86_64-linux-gnu-gcc-7 7.5.0 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

  • [ ] NVIDIA container library logs (see troubleshooting)

  • [ ] Docker command, image and tag used

How can I fix it? Whether or not I should reinstall something?

HYL-Dave avatar Aug 11 '22 17:08 HYL-Dave

@HYL-Dave from the log it seems that you're using v1.3.0 of the NVIDIA Container CLI. Would it be possible to repeat the test with a more up to date version? (v1.10.0 is the latest stable release).

elezar avatar Aug 12 '22 07:08 elezar

@elezar Thanks! I want to try it. How can I update NVIDIA Container CLI? Sorry! But I cannot find a way to update NVIDIA Container CLI only.

HYL-Dave avatar Aug 13 '22 08:08 HYL-Dave

The name of the package is libnvidia-container-tools. That said, it's not recommended to upgrad just this low-level component. You should update the whole toolkit at once by upgrading the package for nvidia-container-toolkit to v1.10.0.

klueska avatar Aug 13 '22 09:08 klueska

@elezar @klueska Thank you! I have update to 1.11, but

  • [ ] NVIDIA container library version from nvidia-container-cli -V -- WARNING, the following logs are for debugging purposes only --

I0813 09:16:48.006521 9590 nvc.c:376] initializing library context (version=1.11.0~rc.2, build=ab4ac25ea4752ec8a01afef6c994754cf67a0796) I0813 09:16:48.006544 9590 nvc.c:350] using root / I0813 09:16:48.006546 9590 nvc.c:351] using ldcache /etc/ld.so.cache I0813 09:16:48.006547 9590 nvc.c:352] using unprivileged user 1000:1000 I0813 09:16:48.006556 9590 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I0813 09:16:48.006623 9590 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment W0813 09:16:48.027562 9591 nvc.c:273] failed to set inheritable capabilities W0813 09:16:48.027589 9591 nvc.c:274] skipping kernel modules load due to failure I0813 09:16:48.027759 9592 rpc.c:71] starting driver rpc service I0813 09:16:48.029131 9593 rpc.c:71] starting nvcgo rpc service I0813 09:16:48.029589 9590 nvc_info.c:766] requesting driver information with '' I0813 09:16:48.030483 9590 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.470.141.03 E0813 09:16:48.030512 9590 nvc_info.c:358] error looking up libraries nvidia-container-cli: detection error: open failed: /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.470.129.06: no such file or directory I0813 09:16:48.030518 9590 nvc.c:434] shutting down library context I0813 09:16:48.030577 9593 rpc.c:95] terminating nvcgo rpc service I0813 09:16:48.030833 9590 rpc.c:135] nvcgo rpc service terminated successfully I0813 09:16:48.031183 9592 rpc.c:95] terminating driver rpc service I0813 09:16:48.031247 9590 rpc.c:135] driver rpc service terminated successfully

HYL-Dave avatar Aug 13 '22 09:08 HYL-Dave

Can you make sure you don't have two conflicting versions of this library on your host. The fact that libnvidia-tls.so.470.129.06 has a different version number than libnvoptix.so.470.141.03 is suspicious.

klueska avatar Aug 13 '22 09:08 klueska

@klueska The following is the related result

ls /usr/lib/x86_64-linux-gnu/ | grep libnvidia-tls*
libnvidia-tls.so.470.141.03
ls /usr/lib/x86_64-linux-gnu/ | grep libnvoptix*
libnvoptix.so.1
libnvoptix.so.470.141.03

HYL-Dave avatar Aug 13 '22 10:08 HYL-Dave

Then I’m guessing this other version of the tls library is embedded in the container image. What container image are you using? One of the default ones from NVIDIA or one generated elsewhere? Note that you must not build an image with the nvidia-container-runtime set as the default runtime for docker. If you do that it will embed the nvidia driver files inside the image for whatever driver version you have. Moreover, their sizes will be 0 (because they aren’t actually part of the image but they we’re in there at build time so a trace of them is leftover as a 0 byte file).

klueska avatar Aug 13 '22 17:08 klueska

@klueska I add some supplement info

docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: detection error: open failed: /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.470.129.06: no such file or directory: unknown.

HYL-Dave avatar Aug 14 '22 11:08 HYL-Dave

Can you run the container without —gpus and see if there are any nvidia driver libs in it

klueska avatar Aug 14 '22 11:08 klueska

@klueska Yes, it works without —gpus

HYL-Dave avatar Aug 14 '22 12:08 HYL-Dave

Yes. I assumed it would. I just want to know if there are any nvidia libs present in the container image (or more specifically the problematic tls one), so you will need to search for them after starting the container without —gpus.

klueska avatar Aug 14 '22 12:08 klueska

@klueska I cannot find any result from ls /usr/lib/x86_64-linux-gnu/ | grep libnvidia in container.

HYL-Dave avatar Aug 14 '22 12:08 HYL-Dave

Somehow / somehwhere your system is presenting a file called libnvidia-tls.so.470.129.06 to the nvidia container stack. Until we locate this file, I won't be able to help you further. It's either on your host system somewhere (try checking outside of /usr/lib/x86_64-linux-gnu/) or it's in the container image somewhere (also try checking outside of /usr/lib/x86_64-linux-gnu/).

klueska avatar Aug 15 '22 10:08 klueska

@klueska I found out there is no file libnvidia-tls.so.470.129.06 even I find such file from root path. Whether or not I should create whole env and os from scratch?

HYL-Dave avatar Aug 15 '22 12:08 HYL-Dave

Do you have any files with the extension 470.129.06 on them? There's no way that libnvidia-container would have just invented this version number. It has to come from some file on your system.

klueska avatar Aug 15 '22 14:08 klueska

@klueska So weird!! I cannot any files with the extension 470.129.06.

HYL-Dave avatar Aug 16 '22 01:08 HYL-Dave

@HYL-Dave try to remove ld cache (probably /etc/ld.so.cache) and rebuild it, then restart docker and try running it again

allflame avatar Oct 03 '22 19:10 allflame