CPM-Bee icon indicating copy to clipboard operation
CPM-Bee copied to clipboard

关于 bmtrain 包的版本问题

Open Lufffya opened this issue 1 year ago • 9 comments

环境: ubuntu server 22.04 conda python 3.10.0 nvidia driver 12.1.0

尝试一: pip install -r requirements.txt

报错:

Collecting torch<2.0.0,>=1.10
  Using cached torch-1.13.1-cp310-cp310-manylinux1_x86_64.whl (887.5 MB)
Collecting bmtrain>=0.2.1
  Using cached bmtrain-0.2.2.tar.gz (58 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [6 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-uftha_ch/bmtrain_6f467e6c12814fa9a32dccab67860fad/setup.py", line 2, in <module>
          import torch
      ModuleNotFoundError: No module named 'torch'
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

尝试二: 1,手动安装 cuda 11.8 下的 pytorch 2.0, torch.cuda.is_available() 输出 True 2,手动安装 requirements.txt 下除 torch 外的所有包 例如:pip install bmtrain>=0.2.1

出现错误:

  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [102 lines of output]
      running bdist_wheel
      /root/anaconda3/envs/CPM-Bee/lib/python3.10/site-packages/torch/utils/cpp_extension.py:476: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
        warnings.warn(msg.format('we could not find ninja.'))
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-cpython-310
      creating build/lib.linux-x86_64-cpython-310/bmtrain
      copying bmtrain/checkpointing.py -> build/lib.linux-x86_64-cpython-310/bmtrain
      copying bmtrain/init.py -> build/lib.linux-x86_64-cpython-310/bmtrain
      copying bmtrain/__init__.py -> build/lib.linux-x86_64-cpython-310/bmtrain
      copying bmtrain/block_layer.py -> build/lib.linux-x86_64-cpython-310/bmtrain
      copying bmtrain/debug.py -> build/lib.linux-x86_64-cpython-310/bmtrain
      copying bmtrain/param_init.py -> build/lib.linux-x86_64-cpython-310/bmtrain
      copying bmtrain/global_var.py -> build/lib.linux-x86_64-cpython-310/bmtrain
      copying bmtrain/synchronize.py -> build/lib.linux-x86_64-cpython-310/bmtrain
      copying bmtrain/parameter.py -> build/lib.linux-x86_64-cpython-310/bmtrain
      copying bmtrain/utils.py -> build/lib.linux-x86_64-cpython-310/bmtrain
      copying bmtrain/pipe_layer.py -> build/lib.linux-x86_64-cpython-310/bmtrain
      copying bmtrain/store.py -> build/lib.linux-x86_64-cpython-310/bmtrain
      copying bmtrain/layer.py -> build/lib.linux-x86_64-cpython-310/bmtrain
      copying bmtrain/wrapper.py -> build/lib.linux-x86_64-cpython-310/bmtrain
      creating build/lib.linux-x86_64-cpython-310/bmtrain/loss
      copying bmtrain/loss/__init__.py -> build/lib.linux-x86_64-cpython-310/bmtrain/loss
      copying bmtrain/loss/cross_entropy.py -> build/lib.linux-x86_64-cpython-310/bmtrain/loss
      creating build/lib.linux-x86_64-cpython-310/bmtrain/optim
      copying bmtrain/optim/__init__.py -> build/lib.linux-x86_64-cpython-310/bmtrain/optim
      copying bmtrain/optim/optim_manager.py -> build/lib.linux-x86_64-cpython-310/bmtrain/optim
      copying bmtrain/optim/adam.py -> build/lib.linux-x86_64-cpython-310/bmtrain/optim
      copying bmtrain/optim/adam_offload.py -> build/lib.linux-x86_64-cpython-310/bmtrain/optim
      creating build/lib.linux-x86_64-cpython-310/bmtrain/benchmark
      copying bmtrain/benchmark/shape.py -> build/lib.linux-x86_64-cpython-310/bmtrain/benchmark
      copying bmtrain/benchmark/__init__.py -> build/lib.linux-x86_64-cpython-310/bmtrain/benchmark
      copying bmtrain/benchmark/send_recv.py -> build/lib.linux-x86_64-cpython-310/bmtrain/benchmark
      copying bmtrain/benchmark/utils.py -> build/lib.linux-x86_64-cpython-310/bmtrain/benchmark
      copying bmtrain/benchmark/all_gather.py -> build/lib.linux-x86_64-cpython-310/bmtrain/benchmark
      copying bmtrain/benchmark/reduce_scatter.py -> build/lib.linux-x86_64-cpython-310/bmtrain/benchmark
      creating build/lib.linux-x86_64-cpython-310/bmtrain/lr_scheduler
      copying bmtrain/lr_scheduler/linear.py -> build/lib.linux-x86_64-cpython-310/bmtrain/lr_scheduler
      copying bmtrain/lr_scheduler/__init__.py -> build/lib.linux-x86_64-cpython-310/bmtrain/lr_scheduler
      copying bmtrain/lr_scheduler/cosine.py -> build/lib.linux-x86_64-cpython-310/bmtrain/lr_scheduler
      copying bmtrain/lr_scheduler/noam.py -> build/lib.linux-x86_64-cpython-310/bmtrain/lr_scheduler
      copying bmtrain/lr_scheduler/warmup.py -> build/lib.linux-x86_64-cpython-310/bmtrain/lr_scheduler
      copying bmtrain/lr_scheduler/no_decay.py -> build/lib.linux-x86_64-cpython-310/bmtrain/lr_scheduler
      copying bmtrain/lr_scheduler/exponential.py -> build/lib.linux-x86_64-cpython-310/bmtrain/lr_scheduler
      creating build/lib.linux-x86_64-cpython-310/bmtrain/distributed
      copying bmtrain/distributed/__init__.py -> build/lib.linux-x86_64-cpython-310/bmtrain/distributed
      copying bmtrain/distributed/ops.py -> build/lib.linux-x86_64-cpython-310/bmtrain/distributed
      creating build/lib.linux-x86_64-cpython-310/bmtrain/inspect
      copying bmtrain/inspect/model.py -> build/lib.linux-x86_64-cpython-310/bmtrain/inspect
      copying bmtrain/inspect/__init__.py -> build/lib.linux-x86_64-cpython-310/bmtrain/inspect
      copying bmtrain/inspect/tensor.py -> build/lib.linux-x86_64-cpython-310/bmtrain/inspect
      copying bmtrain/inspect/format.py -> build/lib.linux-x86_64-cpython-310/bmtrain/inspect
      creating build/lib.linux-x86_64-cpython-310/bmtrain/nccl
      copying bmtrain/nccl/__init__.py -> build/lib.linux-x86_64-cpython-310/bmtrain/nccl
      copying bmtrain/nccl/enums.py -> build/lib.linux-x86_64-cpython-310/bmtrain/nccl
      running build_ext
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-b8c4nobc/bmtrain_56a2d1d4a744452aa5d4e776460000db/setup.py", line 74, in <module>
          setup(
        File "/root/anaconda3/envs/CPM-Bee/lib/python3.10/site-packages/setuptools/__init__.py", line 107, in setup
          return distutils.core.setup(**attrs)
        File "/root/anaconda3/envs/CPM-Bee/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup
          return run_commands(dist)
        File "/root/anaconda3/envs/CPM-Bee/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
          dist.run_commands()
        File "/root/anaconda3/envs/CPM-Bee/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
          self.run_command(cmd)
        File "/root/anaconda3/envs/CPM-Bee/lib/python3.10/site-packages/setuptools/dist.py", line 1244, in run_command
          super().run_command(command)
        File "/root/anaconda3/envs/CPM-Bee/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
          cmd_obj.run()
        File "/root/anaconda3/envs/CPM-Bee/lib/python3.10/site-packages/wheel/bdist_wheel.py", line 325, in run
          self.run_command("build")
        File "/root/anaconda3/envs/CPM-Bee/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
          self.distribution.run_command(command)
        File "/root/anaconda3/envs/CPM-Bee/lib/python3.10/site-packages/setuptools/dist.py", line 1244, in run_command
          super().run_command(command)
        File "/root/anaconda3/envs/CPM-Bee/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
          cmd_obj.run()
        File "/root/anaconda3/envs/CPM-Bee/lib/python3.10/site-packages/setuptools/_distutils/command/build.py", line 131, in run
          self.run_command(cmd_name)
        File "/root/anaconda3/envs/CPM-Bee/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
          self.distribution.run_command(command)
        File "/root/anaconda3/envs/CPM-Bee/lib/python3.10/site-packages/setuptools/dist.py", line 1244, in run_command
          super().run_command(command)
        File "/root/anaconda3/envs/CPM-Bee/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
          cmd_obj.run()
        File "/root/anaconda3/envs/CPM-Bee/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 84, in run
          _build_ext.run(self)
        File "/root/anaconda3/envs/CPM-Bee/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 345, in run
          self.build_extensions()
        File "/root/anaconda3/envs/CPM-Bee/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 499, in build_extensions
          _check_cuda_version(compiler_name, compiler_version)
        File "/root/anaconda3/envs/CPM-Bee/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 387, in _check_cuda_version
          raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda))
      RuntimeError:
      The detected CUDA version (12.1) mismatches the version that was used to compile
      PyTorch (11.8). Please make sure to use the same CUDA versions.
      
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for bmtrain

大概意思就是版本不兼容,所以 bmtrain 要怎么装?

Lufffya avatar Jun 08 '23 03:06 Lufffya

尝试三: pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117

还是出现和上面一样的错误,似乎会检测显卡驱动版本,我这里安装的是12.1.0,有可能更换成11.7就没问题了,但是太麻烦了,放弃了

Lufffya avatar Jun 08 '23 06:06 Lufffya

尝试三: pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117

还是出现和上面一样的错误,似乎会检测显卡驱动版本,我这里安装的是12.1.0,有可能更换成11.7就没问题了,但是太麻烦了,放弃了

首先这里是因为Torch编译CUDA插件要求CUDA Toolkit版本和编译TorchCPP插件的cuda版本一致,和驱动没关系,其次装CUDA很难吗?

LLMChild avatar Jun 08 '23 09:06 LLMChild

我是windows 的cuda 117, ******************************************************************************** python setup.py install running install D:\ProgramData\anaconda3\envs\cpm\lib\site-packages\setuptools_distutils\cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated. !!

    ********************************************************************************
    Please avoid running ``setup.py`` directly.
    Instead, use pypa/build, pypa/installer, pypa/build or
    other standards-based tools.

    See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
    ********************************************************************************

!! self.initialize_options() D:\ProgramData\anaconda3\envs\cpm\lib\site-packages\setuptools_distutils\cmd.py:66: EasyInstallDeprecationWarning: easy_install command is deprecated. !!

    ********************************************************************************
    Please avoid running ``setup.py`` and ``easy_install``.
    Instead, use pypa/build, pypa/installer, pypa/build or
    other standards-based tools.

    See https://github.com/pypa/setuptools/issues/917 for details.
    ********************************************************************************

!! self.initialize_options() running bdist_egg running egg_info writing bmtrain.egg-info\PKG-INFO writing dependency_links to bmtrain.egg-info\dependency_links.txt writing requirements to bmtrain.egg-info\requires.txt writing top-level names to bmtrain.egg-info\top_level.txt reading manifest file 'bmtrain.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' adding license file 'LICENSE' writing manifest file 'bmtrain.egg-info\SOURCES.txt' installing library code to build\bdist.win-amd64\egg running install_lib running build_py running build_ext error: [WinError 2] 系统找不到指定的文件。

用build 安装也报错: python -m build -n -x -w

  • Building wheel... running bdist_wheel running build running build_py running build_ext error: [WinError 2] 系统找不到指定的文件。

skyantao avatar Jun 08 '23 11:06 skyantao

同问,cuda=12.0 python=3.7 torch=1.13,pip install bmtrain就是报错装不上。

xgsong avatar Jun 08 '23 16:06 xgsong

同样报错 安装不上

xiaoguaishoubaobao avatar Jun 09 '23 04:06 xiaoguaishoubaobao

同样报错 安装不上

https://github.com/OpenBMB/CPM-Bee/issues/59#issuecomment-1583979663

xiaoguaishoubaobao avatar Jun 09 '23 05:06 xiaoguaishoubaobao

这是我使用的一个conda环境

conda create --prefix $(pwd)/.conda_env pytorch==1.13.1 pytorch-cuda=11.6 libcusolver-dev -c pytorch -c nvidia

menghuu avatar Jun 11 '23 08:06 menghuu

bdist

直接先安装bdist,再安装bmtrain试试呢

guillaumexu avatar Jun 28 '23 03:06 guillaumexu

win10环境 pip install bmtrain 报错信息 ....... nccl.obj : error LNK2001: 无法解析的外部符号 ncclCommInitRank nccl.obj : error LNK2001: 无法解析的外部符号 ncclReduce nccl.obj : error LNK2001: 无法解析的外部符号 ncclRecv nccl.obj : error LNK2001: 无法解析的外部符号 ncclGroupEnd nccl.obj : error LNK2001: 无法解析的外部符号 ncclSend nccl.obj : error LNK2001: 无法解析的外部符号 ncclCommCount nccl.obj : error LNK2001: 无法解析的外部符号 ncclGetUniqueId nccl.obj : error LNK2001: 无法解析的外部符号 ncclCommDestroy nccl.obj : error LNK2001: 无法解析的外部符号 ncclBroadcast nccl.obj : error LNK2001: 无法解析的外部符号 ncclGroupStart nccl.obj : error LNK2001: 无法解析的外部符号 ncclCommUserRank nccl.obj : error LNK2001: 无法解析的外部符号 ncclReduceScatter nccl.obj : error LNK2001: 无法解析的外部符号 ncclAllGather nccl.obj : error LNK2001: 无法解析的外部符号 ncclAllReduce nccl.obj : error LNK2001: 无法解析的外部符号 ncclGetErrorString build\lib.win-amd64-cpython-38\bmtrain\nccl_C.cp38-win_amd64.pyd : fatal error LNK1120: 15 个无法解析的外部命令 error: command 'D:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.36.32532\bin\HostX86\x64\link.exe' failed with exit code 1120 [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for bmtrain Running setup.py clean for bmtrain Failed to build bmtrain ERROR: Could not build wheels for bmtrain, which is required to install pyproject.toml-based projects 安装bmtrain 一直报这个错 尝试安装了VS2019 与2022 需要怎么安装呢 cuda ==11.7 torch==1.13

GetUsernametsy avatar Jul 07 '23 04:07 GetUsernametsy