TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

🐛 [Bug] Docker build fails

Open Biaocsu opened this issue 3 years ago • 6 comments

Bug Description

Sending build context to Docker daemon  29.57MB
Step 1/34 : ARG BASE=22.01
Step 2/34 : ARG BASE_IMG=nvcr.io/nvidia/pytorch:${BASE}-py3
Step 3/34 : FROM ${BASE_IMG} as base
 ---> 3f290bb26216
Step 4/34 : FROM base as torch-tensorrt-builder-base
 ---> 3f290bb26216
Step 5/34 : RUN rm -rf /opt/pytorch/torch_tensorrt /usr/bin/bazel
 ---> Using cache
 ---> 6a5e47ade331
Step 6/34 : ARG ARCH="x86_64"
 ---> Using cache
 ---> 94acf4d3746c
Step 7/34 : ARG TARGETARCH="amd64"
 ---> Using cache
 ---> 6e159a1f5635
Step 8/34 : ARG BAZEL_VERSION=4.2.1
 ---> Using cache
 ---> 981ce2761707
Step 9/34 : RUN [[ "$TARGETARCH" == "amd64" ]] && ARCH="x86_64" || ARCH="${TARGETARCH}"  && wget -q https://github.com/bazelbuild/bazel/releases/download/${BAZEL_VERSION}/bazel-${BAZEL_VERSION}-linux-${ARCH} -O /usr/bin/bazel  && chmod a+x /usr/bin/bazel
 ---> Using cache
 ---> 3387313ebde0
Step 10/34 : RUN touch /usr/lib/$HOSTTYPE-linux-gnu/libnvinfer_static.a
 ---> Using cache
 ---> 2be1d77d6422
Step 11/34 : RUN rm -rf /usr/local/cuda/lib* /usr/local/cuda/include   && ln -sf /usr/local/cuda/targets/$HOSTTYPE-linux/lib /usr/local/cuda/lib64   && ln -sf /usr/local/cuda/targets/$HOSTTYPE-linux/include /usr/local/cuda/include
 ---> Using cache
 ---> 97185e01fe2f
Step 12/34 : RUN apt-get update && apt-get install -y --no-install-recommends locales ninja-build && rm -rf /var/lib/apt/lists/* && locale-gen en_US.UTF-8
 ---> Using cache
 ---> 85a8c0dd0f06
Step 13/34 : FROM torch-tensorrt-builder-base as torch-tensorrt-builder
 ---> 85a8c0dd0f06
Step 14/34 : RUN rm -rf /opt/pytorch/torch_tensorrt
 ---> Using cache
 ---> 05475178cb1e
Step 15/34 : COPY . /workspace/torch_tensorrt/src
 ---> Using cache
 ---> 5c2b45977e72
Step 16/34 : WORKDIR /workspace/torch_tensorrt/src
 ---> Using cache
 ---> 080d93900192
Step 17/34 : RUN cp ./docker/WORKSPACE.docker WORKSPACE
 ---> Using cache
 ---> f2a6f1878089
Step 18/34 : RUN ./docker/dist-build.sh
 ---> Running in fb5b50c7fca3
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
running bdist_wheel
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
Loading:
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Analyzing: target //:libtorchtrt (1 packages loaded, 0 targets configured)
INFO: Analyzed target //:libtorchtrt (43 packages loaded, 3021 targets configured).
INFO: Found 1 target...
[0 / 81] [Prepa] Writing file cpp/lib/libtorchtrt.so-2.params ... (2 actions, 0 running)
ERROR: /workspace/torch_tensorrt/src/core/lowering/passes/BUILD:10:11: Compiling core/lowering/passes/linear_to_addmm.cpp failed: (Exit 1): gcc failed: error executing command /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 61 argument(s) skipped)

Use --sandbox_debug to see verbose messages from the sandbox
core/lowering/passes/linear_to_addmm.cpp: In function 'void torch_tensorrt::core::lowering::passes::replaceLinearWithBiasNonePattern(std::shared_ptr<torch::jit::Graph>)':
core/lowering/passes/linear_to_addmm.cpp:37:93: error: 'struct torch::jit::Function' has no member named 'graph'
   37 |         std::shared_ptr<torch::jit::Graph> d_graph = decompose_funcs.get_function("linear").graph();
      |                                                                                             ^~~~~
Target //:libtorchtrt failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 166.072s, Critical Path: 17.92s
INFO: 1243 processes: 1196 internal, 47 processwrapper-sandbox.
FAILED: Build did NOT complete successfully
FAILED: Build did NOT complete successfully
using CXX11 ABI build
building libtorchtrt
The command '/bin/sh -c ./docker/dist-build.sh' returned a non-zero code: 1

To Reproduce

Steps to reproduce the behavior:

  1. git clone https://github.com/NVIDIA/Torch-TensorRT
  2. cd Torch-TensorRT
  3. docker build --build-arg BASE=22.01 -f docker/Dockerfile -t torch_tensorrt:latest .

Environment

Build information about Torch-TensorRT can be found by turning on debug messages

  • Torch-TensorRT Version (e.g. 1.0.0):
  • PyTorch Version (e.g. 1.0):
  • CPU Architecture:
  • OS (e.g., Linux):
  • How you installed PyTorch (conda, pip, libtorch, source):
  • Build command you used (if compiling from source):
  • Are you using local sources or building from archives:
  • Python version:
  • CUDA version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

Biaocsu avatar Feb 16 '22 01:02 Biaocsu

@Biaocsu: Which branch are you using? Please use release/ngc/22.01 branch if you want to build the source inside the NGC container.

andi4191 avatar Feb 16 '22 02:02 andi4191

@Biaocsu: Which branch are you using? Please use release/ngc/22.01 branch if you want to build the source inside the NGC container.

I used the latest master branch. OK, I will use "release/ngc/22.01" branch

Biaocsu avatar Feb 16 '22 02:02 Biaocsu

Experiencing the same issue here. checked out release/ngc/22.01 and it's working. Thanks

chophilip21 avatar Feb 17 '22 23:02 chophilip21

@andi4191 can this issue be solved? as it has been so long. I need to pull the master branch for many issue solved where other branch do not support. otherwise, it will be really a big trouble please, help solve and verify this problem

Biaocsu avatar Feb 28 '22 02:02 Biaocsu

Hi @chophilip21 @Biaocsu,

It seems the error is related to this API change updated here in NGC 22.02 release. https://github.com/NVIDIA/Torch-TensorRT/commit/449723ef014fc839cbed5fda0824419a902bd403#

However, I am able to build release/ngc/22.01 branch with BASE=22.01. Can you please check if your are using the correct checked out version?

To summarize, I haven't been able to reproduce this issue.

andi4191 avatar May 09 '22 17:05 andi4191

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days

github-actions[bot] avatar Aug 08 '22 00:08 github-actions[bot]