ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: Build from source failed

Open fwd4 opened this issue 2 years ago • 2 comments

🐛 Describe the bug

I want to build from source the entire project, and failed on pip install ., the error seems to be pytorch header related.

I'm using docker image from nvidia, nvcr.io/nvidia/pytorch:23.07-py3, should be easily reproduced by

  • git clone
  • cd
  • pip install .

image

Environment

No response

fwd4 avatar Sep 28 '23 08:09 fwd4

I met the same error. You need to add #include <thrust/transform_reduce.h> in the cuda_utils.cu file.

imgaojun avatar Sep 28 '23 15:09 imgaojun

amd00@MZ32-00:~/yk_repo/ColossalAI$ sudo docker build -t colossalai ./docker [+] Building 12.6s (10/12) docker:default => [internal] load build definition from Dockerfile 0.0s => => transferring dockerfile: 1.91kB 0.0s => [internal] load .dockerignore 0.0s => => transferring context: 2B 0.0s => [internal] load metadata for docker.io/hpcaitech/cuda-conda:11.3 4.7s => [1/9] FROM docker.io/hpcaitech/cuda-conda:11.3@sha256:8354717606e7be53824ff663ab3d4d0f99473f92896de00131d1e6a9a3bbd21d 0.0s => CACHED [2/9] RUN mkdir ~/.ssh && printf "Host * \n ForwardAgent yes\nHost *\n StrictHostKeyChecking no" > ~/.ssh/config && ssh-ke 0.0s => CACHED [3/9] RUN apt-get update && apt-get install -y infiniband-diags perftest ibverbs-providers libibumad3 libibverbs1 libnl-3-200 libnl- 0.0s => CACHED [4/9] RUN conda install -y pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch 0.0s => CACHED [5/9] RUN apt-get update && apt-get install -y --no-install-recommends ninja-build && apt-get clean && rm -rf /var/lib/apt/l 0.0s => CACHED [6/9] RUN git clone https://github.com/NVIDIA/apex && cd apex && git checkout 91fcaa && pip install packaging && pip ins 0.0s => ERROR [7/9] RUN git clone -b main https://github.com/hpcaitech/ColossalAI.git && cd ./ColossalAI && CUDA_EXT=1 pip install -v --no-cach 7.8s

[7/9] RUN git clone -b main https://github.com/hpcaitech/ColossalAI.git && cd ./ColossalAI && CUDA_EXT=1 pip install -v --no-cache-dir .: 0.264 Cloning into 'ColossalAI'... 4.112 Using pip 21.2.4 from /opt/conda/lib/python3.9/site-packages/pip (python 3.9) 4.194 Processing /workspace/ColossalAI 4.194 DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default. 4.194 pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555. 4.440 Running command python setup.py egg_info 5.440 /tmp/pip-req-build-qn46yarw/op_builder/utils.py:163: UserWarning: 5.440 [extension] PyTorch did not find available GPUs on this system. 5.440 If your intention is to cross-compile, this is not an error. 5.440 By default, Colossal-AI will cross-compile for 5.440 1. Pascal (compute capabilities 6.0, 6.1, 6.2), 5.440 2. Volta (compute capability 7.0) 5.440 3. Turing (compute capability 7.5), 5.440 4. Ampere (compute capability 8.0, 8.6)if the CUDA version is >= 11.0 5.440 5.440 If you wish to cross-compile for a single specific architecture, 5.440 export TORCH_CUDA_ARCH_LIST="compute capability" before running setup.py. 5.440 5.440 warnings.warn( 5.461 Traceback (most recent call last): 5.461 File "", line 1, in 5.461 File "/tmp/pip-req-build-qn46yarw/setup.py", line 136, in 5.461 ext_modules.append(builder_cls().builder()) 5.461 File "/tmp/pip-req-build-qn46yarw/op_builder/builder.py", line 234, in builder 5.461 "nvcc": self.strip_empty_entries(self.nvcc_flags()), 5.461 File "/tmp/pip-req-build-qn46yarw/op_builder/fused_optim.py", line 36, in nvcc_flags 5.461 extra_cuda_flags.extend(get_cuda_cc_flag()) 5.461 File "/tmp/pip-req-build-qn46yarw/op_builder/utils.py", line 207, in get_cuda_cc_flag 5.461 max_arch = "".join(str(i) for i in torch.cuda.get_device_capability()) 5.461 File "/opt/conda/lib/python3.9/site-packages/torch/cuda/init.py", line 345, in get_device_capability 5.462 prop = get_device_properties(device) 5.462 File "/opt/conda/lib/python3.9/site-packages/torch/cuda/init.py", line 359, in get_device_properties 5.462 _lazy_init() # will define _get_device_properties 5.462 File "/opt/conda/lib/python3.9/site-packages/torch/cuda/init.py", line 217, in _lazy_init 5.462 torch._C._cuda_init() 5.462 RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx 5.462 No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda' 5.723 WARNING: Discarding file:///workspace/ColossalAI. Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output. 5.723 ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

SeekPoint avatar Dec 07 '23 12:12 SeekPoint