ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: PyTorch is not found while CUDA_EXT=1. You need to install PyTorch first in order to build CUDA extensions

Open c-box opened this issue 1 year ago • 12 comments

🐛 Describe the bug

I failed to install with “CUDA_EXT=1 pip install .” from source, and the error message is:

Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error

  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [18 lines of output]
      Traceback (most recent call last):
        File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 351, in <module>
          main()
        File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 333, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
        File "/tmp/pip-build-env-1q9jna27/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 338, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=['wheel'])
        File "/tmp/pip-build-env-1q9jna27/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 320, in _get_build_requires
          self.run_setup()
        File "/tmp/pip-build-env-1q9jna27/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 484, in run_setup
          super(_BuildMetaLegacyBackend,
        File "/tmp/pip-build-env-1q9jna27/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 335, in run_setup
          exec(code, locals())
        File "<string>", line 121, in <module>
        File "<string>", line 38, in environment_check_for_cuda_extension_build
      ModuleNotFoundError: [extension] PyTorch is not found while CUDA_EXT=1. You need to install PyTorch first in order to build CUDA extensions
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

However I am definitely sure that pytorch has already been installed. Pytorch Version: 1.10.1 Cuda toolkit Version: 10.2

PS: I have already tried different version of Pytorch (1.8, 1.10, 1.12), but still get the same problem.

Environment

Python: 3.8.13 Pytorch Version: 1.10.1 Cuda toolkit Version: 10.2

c-box avatar Mar 06 '23 09:03 c-box

Did you install PyTorch with Conda?

FrankLeeeee avatar Mar 06 '23 09:03 FrankLeeeee

Did you install PyTorch with Conda?

Yes, I install with "conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=10.2 -c pytorch"

c-box avatar Mar 06 '23 09:03 c-box

Did you install PyTorch with Conda?

Yes, I install with "conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=10.2 -c pytorch"

Ok, that is strange. Let me try to reproduce this. Meanwhile, can you provide the output of pip list and try again with CUDA_EXT=1 python -m pip install colossalai

FrankLeeeee avatar Mar 06 '23 09:03 FrankLeeeee

I do suspect that this is caused by pytorch 1.10, perhaps you may want to try torch 1.11 and above as well.

FrankLeeeee avatar Mar 06 '23 09:03 FrankLeeeee

Ok, that is strange. Let me try to reproduce this. Meanwhile, can you provide the output of pip list and try again with CUDA_EXT=1 python -m pip install colossalai

pip list results:

fsspec             2023.3.0
greenlet           2.0.2
huggingface-hub    0.12.1
identify           2.5.18
idna               3.4
invoke             2.0.0
langchain          0.0.101
loralib            0.1.1
markdown-it-py     2.2.0
marshmallow        3.19.0
marshmallow-enum   1.5.1
mdurl              0.1.2
mkl-fft            1.3.1
mkl-random         1.2.2
mkl-service        2.4.0
multidict          6.0.4
multiprocess       0.70.14
mypy-extensions    1.0.0
ninja              1.11.1
nodeenv            1.7.0
numpy              1.22.3
packaging          23.0
pandas             1.5.3
paramiko           3.0.0
Pillow             9.0.1
pip                22.3.1
platformdirs       3.1.0
pre-commit         3.1.1
psutil             5.9.4
pyarrow            11.0.0
pycparser          2.21
pydantic           1.10.5
Pygments           2.14.0
PyNaCl             1.5.0
pyOpenSSL          23.0.0
PySocks            1.7.1
python-dateutil    2.8.2
pytz               2022.7.1
PyYAML             6.0
regex              2022.10.31
requests           2.28.1
responses          0.18.0
rich               13.3.2
setuptools         65.6.3
six                1.16.0
SQLAlchemy         1.4.46
tenacity           8.2.2
tokenizers         0.13.2
torch              1.10.1
torchaudio         0.10.1
torchvision        0.11.2
tqdm               4.65.0
transformers       4.26.1
typing_extensions  4.4.0
typing-inspect     0.8.0
urllib3            1.26.14
virtualenv         20.20.0
wheel              0.38.4
xxhash             3.2.0
yarl               1.8.2

conda list results:

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main
aiohttp                   3.8.4                     <pip>
aiosignal                 1.3.1                     <pip>
async-timeout             4.0.2                     <pip>
attrs                     22.2.0                    <pip>
bcrypt                    4.0.1                     <pip>
blas                      1.0                         mkl
brotlipy                  0.7.0           py38h27cfd23_1003
bzip2                     1.0.8                h7b6447c_0
ca-certificates           2023.01.10           h06a4308_0
certifi                   2022.12.7        py38h06a4308_0
cffi                      1.15.0           py38hd667e15_1
cfgv                      3.3.1                     <pip>
charset-normalizer        2.0.4              pyhd3eb1b0_0
chatgpt                   0.1.0                     <pip>
click                     8.1.3                     <pip>
colossalai                0.2.5                     <pip>
contexttimer              0.3.3                     <pip>
cryptography              39.0.1           py38h9ce1e76_0
cudatoolkit               10.2.89              hfd86e86_1
dataclasses-json          0.5.7                     <pip>
datasets                  2.10.1                    <pip>
dill                      0.3.6                     <pip>
distlib                   0.3.6                     <pip>
fabric                    3.0.0                     <pip>
ffmpeg                    4.3                  hf484d3e_0    pytorch
filelock                  3.9.0                     <pip>
flit-core                 3.6.0              pyhd3eb1b0_0
freetype                  2.11.0               h70c0345_0
frozenlist                1.3.3                     <pip>
fsspec                    2023.3.0                  <pip>
giflib                    5.2.1                h7b6447c_0
gmp                       6.2.1                h295c915_3
gnutls                    3.6.15               he1e5248_0
greenlet                  2.0.2                     <pip>
huggingface-hub           0.12.1                    <pip>
identify                  2.5.18                    <pip>
idna                      3.4              py38h06a4308_0
intel-openmp              2021.4.0          h06a4308_3561
invoke                    2.0.0                     <pip>
jpeg                      9b                   h024ee3a_2
lame                      3.100                h7b6447c_0
langchain                 0.0.101                   <pip>
lcms2                     2.12                 h3be6417_0
ld_impl_linux-64          2.38                 h1181459_1
libffi                    3.3                  he6710b0_2
libgcc-ng                 9.1.0                hdf63c60_0
libiconv                  1.16                 h7f8727e_2
libidn2                   2.3.2                h7f8727e_0
libpng                    1.6.37               hbc83047_0
libstdcxx-ng              9.1.0                hdf63c60_0
libtasn1                  4.16.0               h27cfd23_0
libtiff                   4.2.0                h85742a9_0
libunistring              0.9.10               h27cfd23_0
libuv                     1.40.0               h7b6447c_0
libwebp                   1.2.0                h89dd481_0
libwebp-base              1.2.0                h27cfd23_0
loralib                   0.1.1                     <pip>
lz4-c                     1.9.3                h295c915_1
markdown-it-py            2.2.0                     <pip>
marshmallow               3.19.0                    <pip>
marshmallow-enum          1.5.1                     <pip>
mdurl                     0.1.2                     <pip>
mkl                       2021.4.0           h06a4308_640
mkl-service               2.4.0            py38h7f8727e_0
mkl_fft                   1.3.1            py38hd3c417c_0
mkl_random                1.2.2            py38h51133e4_0
multidict                 6.0.4                     <pip>
multiprocess              0.70.14                   <pip>
mypy-extensions           1.0.0                     <pip>
ncurses                   6.3                  h7f8727e_2
nettle                    3.7.3                hbbd107a_1
ninja                     1.11.1                    <pip>
ninja                     1.10.2               h06a4308_5
ninja-base                1.10.2               hd09550d_5
nodeenv                   1.7.0                     <pip>
numpy                     1.22.3           py38he7a7128_0
numpy-base                1.22.3           py38hf524024_0
openh264                  2.1.1                h4ff587b_0
openssl                   1.1.1t               h7f8727e_0
packaging                 23.0                      <pip>
pandas                    1.5.3                     <pip>
paramiko                  3.0.0                     <pip>
pillow                    9.0.1            py38h22f2fdc_0
pip                       22.3.1           py38h06a4308_0
platformdirs              3.1.0                     <pip>
pre-commit                3.1.1                     <pip>
psutil                    5.9.4                     <pip>
pyarrow                   11.0.0                    <pip>
pycparser                 2.21               pyhd3eb1b0_0
pydantic                  1.10.5                    <pip>
Pygments                  2.14.0                    <pip>
PyNaCl                    1.5.0                     <pip>
pyopenssl                 23.0.0           py38h06a4308_0
pysocks                   1.7.1            py38h06a4308_0
python                    3.8.13               h12debd9_0
python-dateutil           2.8.2                     <pip>
pytorch                   1.10.1          py3.8_cuda10.2_cudnn7.6.5_0    pytorch
pytorch-mutex             1.0                        cuda    pytorch
pytz                      2022.7.1                  <pip>
PyYAML                    6.0                       <pip>
readline                  8.1.2                h7f8727e_1
regex                     2022.10.31                <pip>
requests                  2.28.1           py38h06a4308_0
responses                 0.18.0                    <pip>
rich                      13.3.2                    <pip>
setuptools                65.6.3           py38h06a4308_0
six                       1.16.0             pyhd3eb1b0_1
SQLAlchemy                1.4.46                    <pip>
sqlite                    3.38.5               hc218d9a_0
tenacity                  8.2.2                     <pip>
tk                        8.6.12               h1ccaba5_0
tokenizers                0.13.2                    <pip>
torchaudio                0.10.1               py38_cu102    pytorch
torchvision               0.11.2               py38_cu102    pytorch
tqdm                      4.65.0                    <pip>
transformers              4.26.1                    <pip>
typing-inspect            0.8.0                     <pip>
typing_extensions         4.4.0            py38h06a4308_0
urllib3                   1.26.14          py38h06a4308_0
virtualenv                20.20.0                   <pip>
wheel                     0.38.4           py38h06a4308_0
xxhash                    3.2.0                     <pip>
xz                        5.2.5                h7f8727e_1
yarl                      1.8.2                     <pip>
zlib                      1.2.12               h7f8727e_2
zstd                      1.4.9                haebb681_0

c-box avatar Mar 06 '23 09:03 c-box

Thanks, let me get back to you once I reach a conclusion.

FrankLeeeee avatar Mar 06 '23 09:03 FrankLeeeee

Ok, that is strange. Let me try to reproduce this. Meanwhile, can you provide the output of pip list and try again with CUDA_EXT=1 python -m pip install colossalai

CUDA_EXT=1 python -m pip install colossalai reports no error, but I still can't important colossalai

>>> import colossalai
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/__init__.py", line 1, in <module>
    from .initialize import (
  File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/initialize.py", line 18, in <module>
    from colossalai.amp import AMP_TYPE, convert_to_amp
  File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/amp/__init__.py", line 9, in <module>
    from .torch_amp import convert_to_torch_amp
  File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/amp/torch_amp/__init__.py", line 9, in <module>
    from .torch_amp import TorchAMPLoss, TorchAMPModel, TorchAMPOptimizer
  File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/amp/torch_amp/torch_amp.py", line 10, in <module>
    from colossalai.nn.optimizer import ColossalaiOptimizer
  File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/nn/__init__.py", line 1, in <module>
    from ._ops import *
  File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/nn/_ops/__init__.py", line 1, in <module>
    from .addmm import colo_addmm
  File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/nn/_ops/addmm.py", line 5, in <module>
    from ._utils import GeneralTensor, Number, convert_to_colo_tensor
  File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/nn/_ops/_utils.py", line 8, in <module>
    from colossalai.nn.layer.utils import divide
  File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/nn/layer/__init__.py", line 1, in <module>
    from .colossalai_layer import *
  File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/__init__.py", line 2, in <module>
    from .dropout import Dropout
  File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/dropout.py", line 5, in <module>
    from ..parallel_1d import *
  File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/nn/layer/parallel_1d/__init__.py", line 1, in <module>
    from .layers import (Classifier1D, Dropout1D, Embedding1D, LayerNorm1D, Linear1D, Linear1D_Col, Linear1D_Row,
  File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/nn/layer/parallel_1d/layers.py", line 17, in <module>
    from colossalai.kernel import LayerNorm
  File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/kernel/__init__.py", line 1, in <module>
    from .cuda_native import FusedScaleMaskSoftmax, LayerNorm, MultiHeadAttention
  File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/kernel/cuda_native/__init__.py", line 1, in <module>
    from .layer_norm import MixedFusedLayerNorm as LayerNorm
  File "/data00/cbx/anaconda3/envs/colossal/lib/python3.8/site-packages/colossalai/kernel/cuda_native/layer_norm.py", line 12, in <module>
    from colossalai.kernel.op_builder.layernorm import LayerNormBuilder
ModuleNotFoundError: No module named 'colossalai.kernel.op_builder'

c-box avatar Mar 06 '23 09:03 c-box

I found that the root cause of pytorch-not-found failure is related to the introduction of pyproject.toml in #2977 . It seems that there is some conflict between pyproject.toml and setup.py. I am trying to fix this bug.

FrankLeeeee avatar Mar 06 '23 15:03 FrankLeeeee

As for CUDA_EXT=1 python -m pip install colossalai works on my platform, not sure if there is any environment issue that we overlook. May I know whether you are running on which operating system. Specifically, it would be appreciated if you can provide the output of cat /etc/os-release. Then I can get a Docker of your OS to try to reproduce this bug. :)

FrankLeeeee avatar Mar 06 '23 15:03 FrankLeeeee

I found that the root cause of pytorch-not-found failure is related to the introduction of pyproject.toml in #2977 . It seems that there is some conflict between pyproject.toml and setup.py. I am trying to fix this bug.

I have set up a PR to fix this issue. Please see #3022 .

FrankLeeeee avatar Mar 06 '23 16:03 FrankLeeeee

As for CUDA_EXT=1 python -m pip install colossalai works on my platform, not sure if there is any environment issue that we overlook. May I know whether you are running on which operating system. Specifically, it would be appreciated if you can provide the output of cat /etc/os-release. Then I can get a Docker of your OS to try to reproduce this bug. :)

PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
NAME="Debian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"
VERSION_CODENAME=stretch
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

c-box avatar Mar 07 '23 02:03 c-box

I found that the root cause of pytorch-not-found failure is related to the introduction of pyproject.toml in #2977 . It seems that there is some conflict between pyproject.toml and setup.py. I am trying to fix this bug.

I have set up a PR to fix this issue. Please see #3022 .

I can successfully install now, but there still exists error ModuleNotFoundError: No module named 'colossalai.kernel.op_builder' when I try to import colossalai.

I have tried different version of pytorch (1.9, 1.10, 1.11, 1.12) and CUDA(10.2 and 11,0), the error always exists.

And I found similar problem in previous issues such as #2771, but still can't fix it.

c-box avatar Mar 07 '23 07:03 c-box

Hi @c-box We have updated a lot. Please check the latest code. https://github.com/hpcaitech/ColossalAI#Installation If you have further questions, please open another new issue and provide details. Because everyone's issue details may be different. This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 27 '23 10:04 binmakeswell