mmcv [Docker: Cannot Build TensorRT Plugin]- No CUDA runtime is found

[Docker: Cannot Build TensorRT Plugin]- No CUDA runtime is found

Open timothylimyl opened this issue 2 years ago • 9 comments

Hi, I have search the issues and it seems that there are no solutions to get mmcv with tensorrrt plugin build in Docker.

Dockerfile:

FROM nvcr.io/nvidia/tensorrt:20.12-py3

. # all other dependencies installation
.

WORKDIR /src/mmcv
RUN MMCV_WITH_OPS=1 MMCV_WITH_TRT=1 pip install -e .

Without TRT, it works: RUN MMCV_WITH_OPS=1 pip install -e .

The installation also works when I run the container which really confuses me....

My workflow now is to build the image without mmcv and then run the installation in the container to re-commit into a new image. It works but it will be better if I can build it directly during the Docker image build.

Any potential solutions?

Error:

#30 2.823     No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
#30 2.823     Compiling mmcv._ext without CUDA
#30 2.823     running develop
#30 2.823     running egg_info
#30 2.823     creating mmcv_full.egg-info
#30 2.823     writing mmcv_full.egg-info/PKG-INFO
#30 2.823     writing dependency_links to mmcv_full.egg-info/dependency_links.txt
#30 2.823     writing requirements to mmcv_full.egg-info/requires.txt
#30 2.823     writing top-level names to mmcv_full.egg-info/top_level.txt
#30 2.823     writing manifest file 'mmcv_full.egg-info/SOURCES.txt'
#30 2.823     reading manifest file 'mmcv_full.egg-info/SOURCES.txt'
#30 2.823     reading manifest template 'MANIFEST.in'
#30 2.823     writing manifest file 'mmcv_full.egg-info/SOURCES.txt'
#30 2.823     running build_ext
#30 2.823     building 'mmcv._ext_trt' extension
#30 2.823     creating /src/mmcv/build
#30 2.823     creating /src/mmcv/build/temp.linux-x86_64-3.8
#30 2.823     creating /src/mmcv/build/temp.linux-x86_64-3.8/mmcv
#30 2.823     creating /src/mmcv/build/temp.linux-x86_64-3.8/mmcv/ops
#30 2.823     creating /src/mmcv/build/temp.linux-x86_64-3.8/mmcv/ops/csrc
#30 2.823     creating /src/mmcv/build/temp.linux-x86_64-3.8/mmcv/ops/csrc/tensorrt
#30 2.823     creating /src/mmcv/build/temp.linux-x86_64-3.8/mmcv/ops/csrc/tensorrt/plugins
#30 2.823     Traceback (most recent call last):
#30 2.823       File "<string>", line 2, in <module>
#30 2.823       File "<pip-setuptools-caller>", line 34, in <module>
#30 2.823       File "/src/mmcv/setup.py", line 375, in <module>
#30 2.823         setup(
#30 2.823       File "/usr/lib/python3/dist-packages/setuptools/__init__.py", line 144, in setup
#30 2.823         return distutils.core.setup(**attrs)
#30 2.823       File "/usr/lib/python3.8/distutils/core.py", line 148, in setup
#30 2.823         dist.run_commands()
#30 2.823       File "/usr/lib/python3.8/distutils/dist.py", line 966, in run_commands
#30 2.823         self.run_command(cmd)
#30 2.823       File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command
#30 2.823         cmd_obj.run()
#30 2.823       File "/usr/lib/python3/dist-packages/setuptools/command/develop.py", line 38, in run
#30 2.823         self.install_for_development()
#30 2.823       File "/usr/lib/python3/dist-packages/setuptools/command/develop.py", line 140, in install_for_development
#30 2.823         self.run_command('build_ext')
#30 2.823       File "/usr/lib/python3.8/distutils/cmd.py", line 313, in run_command
#30 2.823         self.distribution.run_command(command)
#30 2.823       File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command
#30 2.823         cmd_obj.run()
#30 2.823       File "/usr/lib/python3/dist-packages/setuptools/command/build_ext.py", line 87, in run
#30 2.823         _build_ext.run(self)
#30 2.823       File "/usr/lib/python3.8/distutils/command/build_ext.py", line 340, in run
#30 2.823         self.build_extensions()
#30 2.823       File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 708, in build_extensions
#30 2.823         build_ext.build_extensions(self)
#30 2.823       File "/usr/lib/python3.8/distutils/command/build_ext.py", line 449, in build_extensions
#30 2.823         self._build_extensions_serial()
#30 2.823       File "/usr/lib/python3.8/distutils/command/build_ext.py", line 474, in _build_extensions_serial
#30 2.823         self.build_extension(ext)
#30 2.823       File "/usr/lib/python3/dist-packages/setuptools/command/build_ext.py", line 208, in build_extension
#30 2.823         _build_ext.build_extension(self, ext)
#30 2.823       File "/usr/lib/python3.8/distutils/command/build_ext.py", line 528, in build_extension
#30 2.823         objects = self.compiler.compile(sources,
#30 2.823       File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 524, in unix_wrap_ninja_compile
#30 2.823         cuda_post_cflags = unix_cuda_flags(cuda_post_cflags)
#30 2.823       File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 423, in unix_cuda_flags
#30 2.823         cflags + _get_cuda_arch_flags(cflags))
#30 2.823       File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1561, in _get_cuda_arch_flags
#30 2.823         arch_list[-1] += '+PTX'
#30 2.823     IndexError: list index out of range
#30 2.823     [end of output]
#30 2.823 
#30 2.823 note: This error originates from a subprocess, and is likely not a problem with pip.

Feb 09 '22 09:02 timothylimyl

Hi, @timothylimyl. Seems No CUDA runtime is found during building MMCV with TensorRT. Maybe you can refer to mmdeploy and its Dockerfile for some luck.

Feb 24 '22 02:02 AllentDan

Hi @AllentDan , thanks, I will take a look to see whether does it solve my problem and report back.

Mar 12 '22 01:03 timothylimyl

Hi @timothylimyl , is there any progress?

Apr 04 '22 02:04 zhouzaida

Hi @zhouzaida @AllentDan , sorry for the late reply.

Looking over at mmdeploy, I cannot tell what will be causing the issue. I am also using the nvidia official container. Here is my Dockerfile:

FROM nvcr.io/nvidia/tensorrt:20.12-py3

ENV DEBIAN_FRONTEND=noninteractive
SHELL ["/bin/bash","-c"]

ENV FORCE_CUDA="1"
ENV mmcvVersion="v1.4.4"

# Install python libraries (includes torch==1.8.1+cu111, torchvision==0.9.1+cu111)
RUN mkdir /src
WORKDIR /src
ADD ./requirements.txt /src/
RUN pip install --upgrade pip
RUN pip install -r requirements.txt

# Installation other linux dependencies

RUN apt-get update && apt-get install -y python3-opencv \
    git \
    cmake \
    build-essential \
    curl \
    wget \
    gnupg2 \
    lsb-release \
    ca-certificates 

RUN git clone https://github.com/open-mmlab/mmcv.git && \
    cd mmcv && \
    git checkout $mmcvVersion && \
    echo "export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}" >> ~/.bashrc &&\
    echo "export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}" >> ~/.bashrc &&\
    source ~/.bashrc &&\
    MMCV_WITH_OPS=1 MMCV_WITH_TRT=1 pip install -e .

At the last line, if I change it to pip install -e . then it works. The issue is that when building the container, Docker cannot seem to find the cuda path (that's why I tried adding the CUDA PATH). With that being said, if I build the image and run the container, it seems that I can then run MMCV_WITH_OPS=1 MMCV_WITH_TRT=1 pip install -e . and everything works.

I tried an experiment where I run some PyTorch code, I can confirm that PyTorch is unable to find CUDA when building the container. However, everything works well when running the container (PyTorch + MMCV can find CUDA/TensorRT).

Edit: I RUN python3 -c "import torch;print(f'CUDA IS FOUND:{torch.cuda.is_available()}')" to check during image building.

I will start looking into using MMDeploy into the pipeline, but I got a lot of other dependencies in my project and it may not play well with the official MMDeploy Dockerfile.

Apr 05 '22 02:04 timothylimyl

That seems to be the env variable should be specified in the Dockerfile. Try the methods from mmdet issue 281 and envs in mmdeploy dockerfile please.

Apr 06 '22 06:04 AllentDan

Hi @AllentDan , my previous dockerfile already has those env ENV FORCE_CUDA="1" and ENV DEBIAN_FRONTEND=noninteractive

Edit: add my requirements.txt for completeness:

--find-links https://download.pytorch.org/whl/torch_stable.html
torch==1.8.1+cu111 
torchvision==0.9.1+cu111
numpy==1.21.4
onnx==1.10.2
pycuda==2021.1

Apr 18 '22 06:04 timothylimyl

Hi @AllentDan , my previous dockerfile already has those env ENV FORCE_CUDA="1" and ENV DEBIAN_FRONTEND=noninteractive

Edit: add my requirements.txt for completeness:
--find-links https://download.pytorch.org/whl/torch_stable.html
torch==1.8.1+cu111 
torchvision==0.9.1+cu111
numpy==1.21.4
onnx==1.10.2
pycuda==2021.1

After digging into the docker image. It was found that the TENSORRT_DIR has no targets or lib. Please try:

sed -i '143,145d' setup.py && sed -i '142 i\ \ \ \ \ \ \ \ tensorrt_lib_path = "/usr/lib/x86_64-linux-gnu/"' setup.py

to change the tensorrt_lib_path before building mmcv with trt.

Apr 18 '22 09:04 AllentDan

Not too sure what the sed command error means:

 => ERROR [8/8] RUN git clone https://github.com/open-mmlab/mmcv.git &&    1.8s
------                                                                          
 > [8/8] RUN git clone https://github.com/open-mmlab/mmcv.git &&     

cd mmcv &&    

git checkout v1.4.4 && 

 sed -i '143,145d' setup.py && sed -i '142 i\ \ \ \ \ \ \ \ tensorrt_lib_path = "/usr/lib/x86_64-linux-gnu/"' setup.py     

MMCV_WITH_OPS=1 MMCV_WITH_TRT=1 pip install -e .:

#12 0.246 Cloning into 'mmcv'...
#12 1.740 sed: -e expression #1, char 1: unknown command: `.'
------
executor failed running [/bin/bash -c git clone https://github.com/open-mmlab/mmcv.git &&     cd mmcv &&     sed -i '143,145d' setup.py && sed -i '142 i\ \ \ \ \ \ \ \ tensorrt_lib_path = "/usr/lib/x86_64-linux-gnu/"' setup.py     MMCV_WITH_OPS=1 MMCV_WITH_TRT=1 pip install -e .]: exit code: 1

I think the main error is this: #12 1.740 sed: -e expression #1, char 1: unknown command: .'`

Apr 22 '22 03:04 timothylimyl

I mean:

RUN sed -i '144,145d' setup.py && sed -i '142 i\ \ \ \ \ \ \ \ tensorrt_lib_path = "/usr/lib/x86_64-linux-gnu/"' setup.py

And make sure RUN python -c "import torch;print(torch.cuda.is_available())" return true during docker building. If you find cuda is not available during building the docker image. Please try the methods here.

Apr 22 '22 12:04 AllentDan

mmcv mmcv copied to clipboard

[Docker: Cannot Build TensorRT Plugin]- No CUDA runtime is found

mmcv
mmcv copied to clipboard