ColossalAI
ColossalAI copied to clipboard
[BUG]: PyTorch is not found while CUDA_EXT=1. You need to install PyTorch first in order to build CUDA extensions
🐛 Describe the bug
I failed to install with “CUDA_EXT=1 pip install .” from source, and the error message is:
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [18 lines of output]
Traceback (most recent call last):
File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 351, in <module>
main()
File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 333, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 118, in get_requires_for_build_wheel
return hook(config_settings)
File "/tmp/pip-build-env-1q9jna27/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 338, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
File "/tmp/pip-build-env-1q9jna27/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 320, in _get_build_requires
self.run_setup()
File "/tmp/pip-build-env-1q9jna27/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 484, in run_setup
super(_BuildMetaLegacyBackend,
File "/tmp/pip-build-env-1q9jna27/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 335, in run_setup
exec(code, locals())
File "<string>", line 121, in <module>
File "<string>", line 38, in environment_check_for_cuda_extension_build
ModuleNotFoundError: [extension] PyTorch is not found while CUDA_EXT=1. You need to install PyTorch first in order to build CUDA extensions
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
However I am definitely sure that pytorch has already been installed. Pytorch Version: 1.10.1 Cuda toolkit Version: 10.2
PS: I have already tried different version of Pytorch (1.8, 1.10, 1.12), but still get the same problem.
Environment
Python: 3.8.13 Pytorch Version: 1.10.1 Cuda toolkit Version: 10.2
Did you install PyTorch with Conda?
Did you install PyTorch with Conda?
Yes, I install with "conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=10.2 -c pytorch"
Did you install PyTorch with Conda?
Yes, I install with "conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=10.2 -c pytorch"
Ok, that is strange. Let me try to reproduce this. Meanwhile, can you provide the output of pip list
and try again with CUDA_EXT=1 python -m pip install colossalai
I do suspect that this is caused by pytorch 1.10, perhaps you may want to try torch 1.11 and above as well.
Ok, that is strange. Let me try to reproduce this. Meanwhile, can you provide the output of
pip list
and try again withCUDA_EXT=1 python -m pip install colossalai
pip list results:
fsspec 2023.3.0
greenlet 2.0.2
huggingface-hub 0.12.1
identify 2.5.18
idna 3.4
invoke 2.0.0
langchain 0.0.101
loralib 0.1.1
markdown-it-py 2.2.0
marshmallow 3.19.0
marshmallow-enum 1.5.1
mdurl 0.1.2
mkl-fft 1.3.1
mkl-random 1.2.2
mkl-service 2.4.0
multidict 6.0.4
multiprocess 0.70.14
mypy-extensions 1.0.0
ninja 1.11.1
nodeenv 1.7.0
numpy 1.22.3
packaging 23.0
pandas 1.5.3
paramiko 3.0.0
Pillow 9.0.1
pip 22.3.1
platformdirs 3.1.0
pre-commit 3.1.1
psutil 5.9.4
pyarrow 11.0.0
pycparser 2.21
pydantic 1.10.5
Pygments 2.14.0
PyNaCl 1.5.0
pyOpenSSL 23.0.0
PySocks 1.7.1
python-dateutil 2.8.2
pytz 2022.7.1
PyYAML 6.0
regex 2022.10.31
requests 2.28.1
responses 0.18.0
rich 13.3.2
setuptools 65.6.3
six 1.16.0
SQLAlchemy 1.4.46
tenacity 8.2.2
tokenizers 0.13.2
torch 1.10.1
torchaudio 0.10.1
torchvision 0.11.2
tqdm 4.65.0
transformers 4.26.1
typing_extensions 4.4.0
typing-inspect 0.8.0
urllib3 1.26.14
virtualenv 20.20.0
wheel 0.38.4
xxhash 3.2.0
yarl 1.8.2
conda list results:
# Name Version Build Channel
_libgcc_mutex 0.1 main
aiohttp 3.8.4 <pip>
aiosignal 1.3.1 <pip>
async-timeout 4.0.2 <pip>
attrs 22.2.0 <pip>
bcrypt 4.0.1 <pip>
blas 1.0 mkl
brotlipy 0.7.0 py38h27cfd23_1003
bzip2 1.0.8 h7b6447c_0
ca-certificates 2023.01.10 h06a4308_0
certifi 2022.12.7 py38h06a4308_0
cffi 1.15.0 py38hd667e15_1
cfgv 3.3.1 <pip>
charset-normalizer 2.0.4 pyhd3eb1b0_0
chatgpt 0.1.0 <pip>
click 8.1.3 <pip>
colossalai 0.2.5 <pip>
contexttimer 0.3.3 <pip>
cryptography 39.0.1 py38h9ce1e76_0
cudatoolkit 10.2.89 hfd86e86_1
dataclasses-json 0.5.7 <pip>
datasets 2.10.1 <pip>
dill 0.3.6 <pip>
distlib 0.3.6 <pip>
fabric 3.0.0 <pip>
ffmpeg 4.3 hf484d3e_0 pytorch
filelock 3.9.0 <pip>
flit-core 3.6.0 pyhd3eb1b0_0
freetype 2.11.0 h70c0345_0
frozenlist 1.3.3 <pip>
fsspec 2023.3.0 <pip>
giflib 5.2.1 h7b6447c_0
gmp 6.2.1 h295c915_3
gnutls 3.6.15 he1e5248_0
greenlet 2.0.2 <pip>
huggingface-hub 0.12.1 <pip>
identify 2.5.18 <pip>
idna 3.4 py38h06a4308_0
intel-openmp 2021.4.0 h06a4308_3561
invoke 2.0.0 <pip>
jpeg 9b h024ee3a_2
lame 3.100 h7b6447c_0
langchain 0.0.101 <pip>
lcms2 2.12 h3be6417_0
ld_impl_linux-64 2.38 h1181459_1
libffi 3.3 he6710b0_2
libgcc-ng 9.1.0 hdf63c60_0
libiconv 1.16 h7f8727e_2
libidn2 2.3.2 h7f8727e_0
libpng 1.6.37 hbc83047_0
libstdcxx-ng 9.1.0 hdf63c60_0
libtasn1 4.16.0 h27cfd23_0
libtiff 4.2.0 h85742a9_0
libunistring 0.9.10 h27cfd23_0
libuv 1.40.0 h7b6447c_0
libwebp 1.2.0 h89dd481_0
libwebp-base 1.2.0 h27cfd23_0
loralib 0.1.1 <pip>
lz4-c 1.9.3 h295c915_1
markdown-it-py 2.2.0 <pip>
marshmallow 3.19.0 <pip>
marshmallow-enum 1.5.1 <pip>
mdurl 0.1.2 <pip>
mkl 2021.4.0 h06a4308_640
mkl-service 2.4.0 py38h7f8727e_0
mkl_fft 1.3.1 py38hd3c417c_0
mkl_random 1.2.2 py38h51133e4_0
multidict 6.0.4 <pip>
multiprocess 0.70.14 <pip>
mypy-extensions 1.0.0 <pip>
ncurses 6.3 h7f8727e_2
nettle 3.7.3 hbbd107a_1
ninja 1.11.1 <pip>
ninja 1.10.2 h06a4308_5
ninja-base 1.10.2 hd09550d_5
nodeenv 1.7.0 <pip>
numpy 1.22.3 py38he7a7128_0
numpy-base 1.22.3 py38hf524024_0
openh264 2.1.1 h4ff587b_0
openssl 1.1.1t h7f8727e_0
packaging 23.0 <pip>
pandas 1.5.3 <pip>
paramiko 3.0.0 <pip>
pillow 9.0.1 py38h22f2fdc_0
pip 22.3.1 py38h06a4308_0
platformdirs 3.1.0 <pip>
pre-commit 3.1.1 <pip>
psutil 5.9.4 <pip>
pyarrow 11.0.0 <pip>
pycparser 2.21 pyhd3eb1b0_0
pydantic 1.10.5 <pip>
Pygments 2.14.0 <pip>
PyNaCl 1.5.0 <pip>
pyopenssl 23.0.0 py38h06a4308_0
pysocks 1.7.1 py38h06a4308_0
python 3.8.13 h12debd9_0
python-dateutil 2.8.2 <pip>
pytorch 1.10.1 py3.8_cuda10.2_cudnn7.6.5_0 pytorch
pytorch-mutex 1.0 cuda pytorch
pytz 2022.7.1 <pip>
PyYAML 6.0 <pip>
readline 8.1.2 h7f8727e_1
regex 2022.10.31 <pip>
requests 2.28.1 py38h06a4308_0
responses 0.18.0 <pip>
rich 13.3.2 <pip>
setuptools 65.6.3 py38h06a4308_0
six 1.16.0 pyhd3eb1b0_1
SQLAlchemy 1.4.46 <pip>
sqlite 3.38.5 hc218d9a_0
tenacity 8.2.2 <pip>
tk 8.6.12 h1ccaba5_0
tokenizers 0.13.2 <pip>
torchaudio 0.10.1 py38_cu102 pytorch
torchvision 0.11.2 py38_cu102 pytorch
tqdm 4.65.0 <pip>
transformers 4.26.1 <pip>
typing-inspect 0.8.0 <pip>
typing_extensions 4.4.0 py38h06a4308_0
urllib3 1.26.14 py38h06a4308_0
virtualenv 20.20.0 <pip>
wheel 0.38.4 py38h06a4308_0
xxhash 3.2.0 <pip>
xz 5.2.5 h7f8727e_1
yarl 1.8.2 <pip>
zlib 1.2.12 h7f8727e_2
zstd 1.4.9 haebb681_0
Thanks, let me get back to you once I reach a conclusion.
Ok, that is strange. Let me try to reproduce this. Meanwhile, can you provide the output of
pip list
and try again withCUDA_EXT=1 python -m pip install colossalai
CUDA_EXT=1 python -m pip install colossalai
reports no error, but I still can't important colossalai
>>> import colossalai
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/__init__.py", line 1, in <module>
from .initialize import (
File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/initialize.py", line 18, in <module>
from colossalai.amp import AMP_TYPE, convert_to_amp
File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/amp/__init__.py", line 9, in <module>
from .torch_amp import convert_to_torch_amp
File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/amp/torch_amp/__init__.py", line 9, in <module>
from .torch_amp import TorchAMPLoss, TorchAMPModel, TorchAMPOptimizer
File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/amp/torch_amp/torch_amp.py", line 10, in <module>
from colossalai.nn.optimizer import ColossalaiOptimizer
File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/nn/__init__.py", line 1, in <module>
from ._ops import *
File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/nn/_ops/__init__.py", line 1, in <module>
from .addmm import colo_addmm
File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/nn/_ops/addmm.py", line 5, in <module>
from ._utils import GeneralTensor, Number, convert_to_colo_tensor
File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/nn/_ops/_utils.py", line 8, in <module>
from colossalai.nn.layer.utils import divide
File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/nn/layer/__init__.py", line 1, in <module>
from .colossalai_layer import *
File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/__init__.py", line 2, in <module>
from .dropout import Dropout
File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/dropout.py", line 5, in <module>
from ..parallel_1d import *
File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/nn/layer/parallel_1d/__init__.py", line 1, in <module>
from .layers import (Classifier1D, Dropout1D, Embedding1D, LayerNorm1D, Linear1D, Linear1D_Col, Linear1D_Row,
File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/nn/layer/parallel_1d/layers.py", line 17, in <module>
from colossalai.kernel import LayerNorm
File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/kernel/__init__.py", line 1, in <module>
from .cuda_native import FusedScaleMaskSoftmax, LayerNorm, MultiHeadAttention
File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/kernel/cuda_native/__init__.py", line 1, in <module>
from .layer_norm import MixedFusedLayerNorm as LayerNorm
File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/kernel/cuda_native/layer_norm.py", line 12, in <module>
from colossalai.kernel.op_builder.layernorm import LayerNormBuilder
ModuleNotFoundError: No module named 'colossalai.kernel.op_builder'
I found that the root cause of pytorch-not-found failure is related to the introduction of pyproject.toml
in #2977 . It seems that there is some conflict between pyproject.toml
and setup.py
. I am trying to fix this bug.
As for CUDA_EXT=1 python -m pip install colossalai
works on my platform, not sure if there is any environment issue that we overlook. May I know whether you are running on which operating system. Specifically, it would be appreciated if you can provide the output of cat /etc/os-release
. Then I can get a Docker of your OS to try to reproduce this bug. :)
I found that the root cause of pytorch-not-found failure is related to the introduction of
pyproject.toml
in #2977 . It seems that there is some conflict betweenpyproject.toml
andsetup.py
. I am trying to fix this bug.
I have set up a PR to fix this issue. Please see #3022 .
As for
CUDA_EXT=1 python -m pip install colossalai
works on my platform, not sure if there is any environment issue that we overlook. May I know whether you are running on which operating system. Specifically, it would be appreciated if you can provide the output ofcat /etc/os-release
. Then I can get a Docker of your OS to try to reproduce this bug. :)
PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
NAME="Debian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"
VERSION_CODENAME=stretch
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
I found that the root cause of pytorch-not-found failure is related to the introduction of
pyproject.toml
in #2977 . It seems that there is some conflict betweenpyproject.toml
andsetup.py
. I am trying to fix this bug.I have set up a PR to fix this issue. Please see #3022 .
I can successfully install now, but there still exists error ModuleNotFoundError: No module named 'colossalai.kernel.op_builder'
when I try to import colossalai
.
I have tried different version of pytorch (1.9, 1.10, 1.11, 1.12) and CUDA(10.2 and 11,0), the error always exists.
And I found similar problem in previous issues such as #2771, but still can't fix it.
Hi @c-box We have updated a lot. Please check the latest code. https://github.com/hpcaitech/ColossalAI#Installation If you have further questions, please open another new issue and provide details. Because everyone's issue details may be different. This issue was closed due to inactivity. Thanks.