TensorRT
TensorRT copied to clipboard
🐛 [Bug] Can't build Docker image on Windows.
Bug Description
I'm completely new to Docker but, after trying unsuccessfully to install Torch-TensorRT with its dependencies, I wanted to try this approach. However, when I try to follow the instructions I encounter a series of problems/bugs as described below:
To Reproduce
Steps to reproduce the behavior:
After installing Docker, run on command prompt the following commands in a local directory:
-
docker pull nvcr.io/nvidia/pytorch:21.12-py3
-
git clone https://github.com/NVIDIA/Torch-TensorRT.git
-
cd Torch-TensorRT
-
docker build --build-arg BASE=21.12 -f docker/Dockerfile -t torch_tensorrt:latest .
+] Building 1.4s (15/25)
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 2.46kB 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 1.05kB 0.0s
=> [internal] load metadata for nvcr.io/nvidia/pytorch:21.12-py3 0.0s
=> CACHED [base 1/1] FROM nvcr.io/nvidia/pytorch:21.12-py3 0.0s
=> [internal] load build context 0.5s
=> => transferring context: 26.61MB 0.4s
=> CACHED [torch-tensorrt-builder-base 1/5] RUN rm -rf /opt/torch-tensorrt /usr/bin/bazel 0.0s
=> CACHED [torch-tensorrt-builder-base 2/5] RUN [[ "amd64" == "amd64" ]] && ARCH="x86_64" || ARCH="amd64" && wget -q https://github.com/bazelbuild/bazel/releases/download/4.2.1/bazel-4.2.1-linux-x86_64 -O /usr/bin/bazel && chmo 0.0s
=> CACHED [torch-tensorrt-builder-base 3/5] RUN touch /usr/lib/$HOSTTYPE-linux-gnu/libnvinfer_static.a 0.0s
=> CACHED [torch-tensorrt-builder-base 4/5] RUN rm -rf /usr/local/cuda/lib* /usr/local/cuda/include && ln -sf /usr/local/cuda/targets/$HOSTTYPE-linux/lib /usr/local/cuda/lib64 && ln -sf /usr/local/cuda/targets/$HOSTTYPE-linux 0.0s
=> CACHED [torch-tensorrt-builder-base 5/5] RUN apt-get update && apt-get install -y --no-install-recommends locales ninja-build && rm -rf /var/lib/apt/lists/* && locale-gen en_US.UTF-8 0.0s
=> [torch-tensorrt-builder 1/4] COPY . /workspace/torch_tensorrt/src 0.2s
=> [torch-tensorrt 1/11] COPY . /workspace/torch_tensorrt 0.1s
=> [torch-tensorrt-builder 2/4] WORKDIR /workspace/torch_tensorrt/src 0.0s
=> [torch-tensorrt-builder 3/4] RUN cp ./docker/WORKSPACE.docker WORKSPACE 0.3s
=> ERROR [torch-tensorrt-builder 4/4] RUN ./docker/dist-build.sh 0.3s
------
> [torch-tensorrt-builder 4/4] RUN ./docker/dist-build.sh:
#15 0.272 /bin/bash: ./docker/dist-build.sh: /bin/bash^M: bad interpreter: No such file or directory
------
executor failed running [/bin/sh -c ./docker/dist-build.sh]: exit code: 126
To solve this issue I followed the suggestion here and run:
-
sed -i -e 's/\r$//' scriptname.sh
Then, I retried with
-
docker build --build-arg BASE=21.12 -f docker/Dockerfile -t torch_tensorrt:latest .
And this time the error was:
[+] Building 118.9s (15/25)
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 32B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 35B 0.0s
=> [internal] load metadata for nvcr.io/nvidia/pytorch:21.12-py3 0.0s
=> CACHED [base 1/1] FROM nvcr.io/nvidia/pytorch:21.12-py3 0.0s
=> [internal] load build context 0.2s
=> => transferring context: 48.70kB 0.2s
=> CACHED [torch-tensorrt-builder-base 1/5] RUN rm -rf /opt/torch-tensorrt /usr/bin/bazel 0.0s
=> CACHED [torch-tensorrt-builder-base 2/5] RUN [[ "amd64" == "amd64" ]] && ARCH="x86_64" || ARCH="amd64" && wget -q https://github.com/bazelbuild/bazel/releases/download/4.2.1/bazel-4.2.1-linux-x86_64 -O /usr/bin/bazel && chmo 0.0s
=> CACHED [torch-tensorrt-builder-base 3/5] RUN touch /usr/lib/$HOSTTYPE-linux-gnu/libnvinfer_static.a 0.0s
=> CACHED [torch-tensorrt-builder-base 4/5] RUN rm -rf /usr/local/cuda/lib* /usr/local/cuda/include && ln -sf /usr/local/cuda/targets/$HOSTTYPE-linux/lib /usr/local/cuda/lib64 && ln -sf /usr/local/cuda/targets/$HOSTTYPE-linux 0.0s
=> CACHED [torch-tensorrt-builder-base 5/5] RUN apt-get update && apt-get install -y --no-install-recommends locales ninja-build && rm -rf /var/lib/apt/lists/* && locale-gen en_US.UTF-8 0.0s
=> [torch-tensorrt-builder 1/4] COPY . /workspace/torch_tensorrt/src 0.1s
=> [torch-tensorrt 1/11] COPY . /workspace/torch_tensorrt 0.1s
=> [torch-tensorrt-builder 2/4] WORKDIR /workspace/torch_tensorrt/src 0.0s
=> [torch-tensorrt-builder 3/4] RUN cp ./docker/WORKSPACE.docker WORKSPACE 0.2s
=> ERROR [torch-tensorrt-builder 4/4] RUN ./docker/dist-build.sh 118.0s
------
> [torch-tensorrt-builder 4/4] RUN ./docker/dist-build.sh:
#15 2.846 No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
#15 2.846 running bdist_wheel
#15 2.888 Extracting Bazel installation...
#15 5.161 Starting local Bazel server and connecting to it...
#15 6.413 Loading:
#15 6.416 Loading: 0 packages loaded
#15 7.420 Loading: 0 packages loaded
#15 8.415 Analyzing: target //:libtorchtrt (1 packages loaded, 0 targets configured)
#15 9.421 Analyzing: target //:libtorchtrt (35 packages loaded, 75 targets configured)
#15 10.08 INFO: Analyzed target //:libtorchtrt (43 packages loaded, 2967 targets configured).
#15 10.08 INFO: Found 1 target...
#15 10.14 [0 / 117] [Prepa] Writing file cpp/lib/libtorchtrt.so-2.params
#15 11.14 [160 / 465] [Prepa] action 'SolibSymlink _solib_k8/_U@cuda_S_S_Ccublas___Ulib64/libcublas.so' ... (2 actions, 0 running)
#15 12.43 [629 / 731] [Prepa] action 'SolibSymlink _solib_k8/_U@libtorch_S_S_Ctorch___Ulib/libtorch_cpu.so' ... (4 actions, 3 running)
#15 13.44 [631 / 731] Compiling core/util/trt_util.cpp; 1s processwrapper-sandbox ... (5 actions running)
#15 14.52 [631 / 731] Compiling core/util/trt_util.cpp; 2s processwrapper-sandbox ... (5 actions running)
#15 17.73 [631 / 731] Compiling core/util/trt_util.cpp; 5s processwrapper-sandbox ... (6 actions, 5 running)
#15 19.67 [632 / 731] Compiling core/util/trt_util.cpp; 7s processwrapper-sandbox ... (6 actions, 5 running)
#15 22.03 [633 / 731] Compiling core/util/trt_util.cpp; 10s processwrapper-sandbox ... (6 actions, 5 running)
#15 25.15 [634 / 731] Compiling core/util/trt_util.cpp; 13s processwrapper-sandbox ... (6 actions, 5 running)
#15 29.72 [637 / 731] Compiling core/util/trt_util.cpp; 17s processwrapper-sandbox ... (6 actions, 5 running)
#15 50.81 [637 / 731] Compiling core/util/trt_util.cpp; 36s processwrapper-sandbox ... (6 actions, 5 running)
#15 73.30 [637 / 731] Compiling core/util/trt_util.cpp; 59s processwrapper-sandbox ... (6 actions, 5 running)
#15 83.28 [637 / 731] Compiling core/util/trt_util.cpp; 70s processwrapper-sandbox ... (6 actions, 5 running)
#15 104.4 [637 / 731] Compiling core/util/trt_util.cpp; 91s processwrapper-sandbox ... (6 actions, 5 running)
#15 113.1 ERROR: /workspace/torch_tensorrt/src/core/plugins/BUILD:10:11: Compiling core/plugins/impl/interpolate_plugin.cpp failed: (Exit 1): gcc failed: error executing command /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 62 argument(s) skipped)
#15 113.1
#15 113.1 Use --sandbox_debug to see verbose messages from the sandbox
#15 113.1 gcc: fatal error: Killed signal terminated program cc1plus
#15 113.1 compilation terminated.
#15 114.0 Target //:libtorchtrt failed to build
#15 114.0 Use --verbose_failures to see the command lines of failed build steps.
#15 114.1 INFO: Elapsed time: 111.118s, Critical Path: 101.93s
#15 114.1 INFO: 643 processes: 637 internal, 6 processwrapper-sandbox.
#15 114.1 FAILED: Build did NOT complete successfully
#15 114.2 FAILED: Build did NOT complete successfully
#15 114.3 using CXX11 ABI build
#15 114.3 building libtorchtrt
------
executor failed running [/bin/sh -c ./docker/dist-build.sh]: exit code: 1
What am I doing wrong? It may be completely trivial since I have no experience in Docker.
Expected behavior
No errors.
Environment
- Torch-TensorRT Version (e.g. 1.0.0): 1.0.0 (latest)
- PyTorch Version (e.g. 1.0): 1.10
- CPU Architecture: AMD64
- OS: Windows 10
- How you installed PyTorch: pip & LibTorch
- Python version: 3.9.9
- CUDA version: 10.2
- GPU models and configuration: GeForce RTX 2060
I don't think we have tried this usecase before. Is this building a Linux container on windows? Would you be using this with NVIDIA Docker or something? I think the first step is to figure out exactly what is failing. It seems like you are using a more modern base container (NGC 21.12) which uses a post 1.10 release build of PyTorch. We are aware of a few breaking changes in PyTorch in this container. I suggest you try to build the release/ngc/21.12
branch to start and see if this fixes these issues
cc: @andi4191
Hi @narendasan, thanks for the response. Yes exactly, I'd like to build the docker image and optimize a PyTorch model on a Windows 10 notebook with either a Laptop NVIDIA GeForce RTX 3080 or a Laptop NVIDIA GeForce RTX 2060.
I tried to directly use the NGC PyTorch container (see PR #755).
On my PC with the RTX 3080 I installed this drivers (497.29), as suggested by the download guide. Same thing for my PC with the RTX 2060. If I then run the command:
docker run --gpus all -it --rm -v \my\local\directory:/some/container/directory nvcr.io/nvidia/pytorch:21.12-py3
It gives me the following error:
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Runnin hook #0: error running hook: exit status 1, stdout: , stderr: nvidia container-cli: initialization error: driver error:failed to process request: unknown.
Which I think is due to the fact that I didn't install the NVIDIA Container Toolkit (I'm not sure it's even possible to install it on a Windows machine since the instructions are just for Linux systems)
In fact, if I run the same command without the --gpus all
option the image runs without problems.
Finally, I tried to run the image on a desktop Linux host with different GPUs (GeForce GTX 1080 Ti) and drivers (495
if I remember correctly) and it worked perfectly, however, if I try to use the optimized PyTorch model on a Windows machine, it just can't load the model (it gives me an unspecified error when I try to load it through the torch::jit::load(MODEL_PATH);
function), so I suppose that I have to optimize the model on the same hardware where I will use it, am I right?
Thanks
This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days
Bug Description
I'm completely new to Docker but, after trying unsuccessfully to install Torch-TensorRT with its dependencies, I wanted to try this approach. However, when I try to follow the instructions I encounter a series of problems/bugs as described below:
To Reproduce
Steps to reproduce the behavior:
After installing Docker, run on command prompt the following commands in a local directory:
docker pull nvcr.io/nvidia/pytorch:21.12-py3
git clone https://github.com/NVIDIA/Torch-TensorRT.git
cd Torch-TensorRT
docker build --build-arg BASE=21.12 -f docker/Dockerfile -t torch_tensorrt:latest .
+] Building 1.4s (15/25) => [internal] load build definition from Dockerfile 0.0s => => transferring dockerfile: 2.46kB 0.0s => [internal] load .dockerignore 0.0s => => transferring context: 1.05kB 0.0s => [internal] load metadata for nvcr.io/nvidia/pytorch:21.12-py3 0.0s => CACHED [base 1/1] FROM nvcr.io/nvidia/pytorch:21.12-py3 0.0s => [internal] load build context 0.5s => => transferring context: 26.61MB 0.4s => CACHED [torch-tensorrt-builder-base 1/5] RUN rm -rf /opt/torch-tensorrt /usr/bin/bazel 0.0s => CACHED [torch-tensorrt-builder-base 2/5] RUN [[ "amd64" == "amd64" ]] && ARCH="x86_64" || ARCH="amd64" && wget -q https://github.com/bazelbuild/bazel/releases/download/4.2.1/bazel-4.2.1-linux-x86_64 -O /usr/bin/bazel && chmo 0.0s => CACHED [torch-tensorrt-builder-base 3/5] RUN touch /usr/lib/$HOSTTYPE-linux-gnu/libnvinfer_static.a 0.0s => CACHED [torch-tensorrt-builder-base 4/5] RUN rm -rf /usr/local/cuda/lib* /usr/local/cuda/include && ln -sf /usr/local/cuda/targets/$HOSTTYPE-linux/lib /usr/local/cuda/lib64 && ln -sf /usr/local/cuda/targets/$HOSTTYPE-linux 0.0s => CACHED [torch-tensorrt-builder-base 5/5] RUN apt-get update && apt-get install -y --no-install-recommends locales ninja-build && rm -rf /var/lib/apt/lists/* && locale-gen en_US.UTF-8 0.0s => [torch-tensorrt-builder 1/4] COPY . /workspace/torch_tensorrt/src 0.2s => [torch-tensorrt 1/11] COPY . /workspace/torch_tensorrt 0.1s => [torch-tensorrt-builder 2/4] WORKDIR /workspace/torch_tensorrt/src 0.0s => [torch-tensorrt-builder 3/4] RUN cp ./docker/WORKSPACE.docker WORKSPACE 0.3s => ERROR [torch-tensorrt-builder 4/4] RUN ./docker/dist-build.sh 0.3s ------ > [torch-tensorrt-builder 4/4] RUN ./docker/dist-build.sh: #15 0.272 /bin/bash: ./docker/dist-build.sh: /bin/bash^M: bad interpreter: No such file or directory ------ executor failed running [/bin/sh -c ./docker/dist-build.sh]: exit code: 126
To solve this issue I followed the suggestion here and run:
sed -i -e 's/\r$//' scriptname.sh
Then, I retried with
docker build --build-arg BASE=21.12 -f docker/Dockerfile -t torch_tensorrt:latest .
And this time the error was:
[+] Building 118.9s (15/25) => [internal] load build definition from Dockerfile 0.0s => => transferring dockerfile: 32B 0.0s => [internal] load .dockerignore 0.0s => => transferring context: 35B 0.0s => [internal] load metadata for nvcr.io/nvidia/pytorch:21.12-py3 0.0s => CACHED [base 1/1] FROM nvcr.io/nvidia/pytorch:21.12-py3 0.0s => [internal] load build context 0.2s => => transferring context: 48.70kB 0.2s => CACHED [torch-tensorrt-builder-base 1/5] RUN rm -rf /opt/torch-tensorrt /usr/bin/bazel 0.0s => CACHED [torch-tensorrt-builder-base 2/5] RUN [[ "amd64" == "amd64" ]] && ARCH="x86_64" || ARCH="amd64" && wget -q https://github.com/bazelbuild/bazel/releases/download/4.2.1/bazel-4.2.1-linux-x86_64 -O /usr/bin/bazel && chmo 0.0s => CACHED [torch-tensorrt-builder-base 3/5] RUN touch /usr/lib/$HOSTTYPE-linux-gnu/libnvinfer_static.a 0.0s => CACHED [torch-tensorrt-builder-base 4/5] RUN rm -rf /usr/local/cuda/lib* /usr/local/cuda/include && ln -sf /usr/local/cuda/targets/$HOSTTYPE-linux/lib /usr/local/cuda/lib64 && ln -sf /usr/local/cuda/targets/$HOSTTYPE-linux 0.0s => CACHED [torch-tensorrt-builder-base 5/5] RUN apt-get update && apt-get install -y --no-install-recommends locales ninja-build && rm -rf /var/lib/apt/lists/* && locale-gen en_US.UTF-8 0.0s => [torch-tensorrt-builder 1/4] COPY . /workspace/torch_tensorrt/src 0.1s => [torch-tensorrt 1/11] COPY . /workspace/torch_tensorrt 0.1s => [torch-tensorrt-builder 2/4] WORKDIR /workspace/torch_tensorrt/src 0.0s => [torch-tensorrt-builder 3/4] RUN cp ./docker/WORKSPACE.docker WORKSPACE 0.2s => ERROR [torch-tensorrt-builder 4/4] RUN ./docker/dist-build.sh 118.0s ------ > [torch-tensorrt-builder 4/4] RUN ./docker/dist-build.sh: #15 2.846 No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda' #15 2.846 running bdist_wheel #15 2.888 Extracting Bazel installation... #15 5.161 Starting local Bazel server and connecting to it... #15 6.413 Loading: #15 6.416 Loading: 0 packages loaded #15 7.420 Loading: 0 packages loaded #15 8.415 Analyzing: target //:libtorchtrt (1 packages loaded, 0 targets configured) #15 9.421 Analyzing: target //:libtorchtrt (35 packages loaded, 75 targets configured) #15 10.08 INFO: Analyzed target //:libtorchtrt (43 packages loaded, 2967 targets configured). #15 10.08 INFO: Found 1 target... #15 10.14 [0 / 117] [Prepa] Writing file cpp/lib/libtorchtrt.so-2.params #15 11.14 [160 / 465] [Prepa] action 'SolibSymlink _solib_k8/_U@cuda_S_S_Ccublas___Ulib64/libcublas.so' ... (2 actions, 0 running) #15 12.43 [629 / 731] [Prepa] action 'SolibSymlink _solib_k8/_U@libtorch_S_S_Ctorch___Ulib/libtorch_cpu.so' ... (4 actions, 3 running) #15 13.44 [631 / 731] Compiling core/util/trt_util.cpp; 1s processwrapper-sandbox ... (5 actions running) #15 14.52 [631 / 731] Compiling core/util/trt_util.cpp; 2s processwrapper-sandbox ... (5 actions running) #15 17.73 [631 / 731] Compiling core/util/trt_util.cpp; 5s processwrapper-sandbox ... (6 actions, 5 running) #15 19.67 [632 / 731] Compiling core/util/trt_util.cpp; 7s processwrapper-sandbox ... (6 actions, 5 running) #15 22.03 [633 / 731] Compiling core/util/trt_util.cpp; 10s processwrapper-sandbox ... (6 actions, 5 running) #15 25.15 [634 / 731] Compiling core/util/trt_util.cpp; 13s processwrapper-sandbox ... (6 actions, 5 running) #15 29.72 [637 / 731] Compiling core/util/trt_util.cpp; 17s processwrapper-sandbox ... (6 actions, 5 running) #15 50.81 [637 / 731] Compiling core/util/trt_util.cpp; 36s processwrapper-sandbox ... (6 actions, 5 running) #15 73.30 [637 / 731] Compiling core/util/trt_util.cpp; 59s processwrapper-sandbox ... (6 actions, 5 running) #15 83.28 [637 / 731] Compiling core/util/trt_util.cpp; 70s processwrapper-sandbox ... (6 actions, 5 running) #15 104.4 [637 / 731] Compiling core/util/trt_util.cpp; 91s processwrapper-sandbox ... (6 actions, 5 running) #15 113.1 ERROR: /workspace/torch_tensorrt/src/core/plugins/BUILD:10:11: Compiling core/plugins/impl/interpolate_plugin.cpp failed: (Exit 1): gcc failed: error executing command /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 62 argument(s) skipped) #15 113.1 #15 113.1 Use --sandbox_debug to see verbose messages from the sandbox #15 113.1 gcc: fatal error: Killed signal terminated program cc1plus #15 113.1 compilation terminated. #15 114.0 Target //:libtorchtrt failed to build #15 114.0 Use --verbose_failures to see the command lines of failed build steps. #15 114.1 INFO: Elapsed time: 111.118s, Critical Path: 101.93s #15 114.1 INFO: 643 processes: 637 internal, 6 processwrapper-sandbox. #15 114.1 FAILED: Build did NOT complete successfully #15 114.2 FAILED: Build did NOT complete successfully #15 114.3 using CXX11 ABI build #15 114.3 building libtorchtrt ------ executor failed running [/bin/sh -c ./docker/dist-build.sh]: exit code: 1
What am I doing wrong? It may be completely trivial since I have no experience in Docker.
Expected behavior
No errors.
Environment
- Torch-TensorRT Version (e.g. 1.0.0): 1.0.0 (latest)
- PyTorch Version (e.g. 1.0): 1.10
- CPU Architecture: AMD64
- OS: Windows 10
- How you installed PyTorch: pip & LibTorch
- Python version: 3.9.9
- CUDA version: 10.2
- GPU models and configuration: GeForce RTX 2060
Hi @andreabonvini,
I think the issue is with the code base version checked out here. Please checkout release/ngc/21.12 branch if you want to use NGC 21.12 container as BASE container. Can you please try this and let us know? Please remove any patches and try.
This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days
Please see #1058 for building on windows. If this is not sufficient please file a new issue w/ the problem!