xtuner icon indicating copy to clipboard operation
xtuner copied to clipboard

[紧急!!!]训练 MoE 模型建议额外安装 GroupedGEMM失败

Open BruceYu-Bit opened this issue 3 months ago • 16 comments

按照官方文档安装GroupedGEMM,无法build。报错如下 Running setup.py clean for grouped_gemm Running command python setup.py clean /root/anaconda3/envs/xtuner/lib/python3.10/site-packages/setuptools/dist.py:759: SetuptoolsDeprecationWarning: License classifiers are deprecated. !!

      ********************************************************************************
      Please consider removing the following classifiers in favor of a SPDX license expression:

      License :: OSI Approved :: BSD License

      See https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license for details.
      ********************************************************************************

!! self._finalize_license_expression() running clean removing 'build/temp.linux-x86_64-cpython-310' (and everything under it) removing 'build/lib.linux-x86_64-cpython-310' (and everything under it) 'build/bdist.linux-x86_64' does not exist -- can't clean it 'build/scripts-3.10' does not exist -- can't clean it removing 'build' Failed to build grouped_gemm

BruceYu-Bit avatar Sep 17 '25 03:09 BruceYu-Bit

@BruceYu-Bit 请补充你安装方式(如果是官方文档是否为:pip install git+https://github.com/InternLM/GroupedGEMM.git@main),以及使用的conda环境,可以用conda env export -n xtuner > xtuner.yaml,然后给出xtuner.yaml,我这边可以尝试复现一下。

CyCle1024 avatar Sep 17 '25 14:09 CyCle1024

按照官方文档安装GroupedGEMM,无法build。报错如下 Running setup.py clean for grouped_gemm Running command python setup.py clean /root/anaconda3/envs/xtuner/lib/python3.10/site-packages/setuptools/dist.py:759: SetuptoolsDeprecationWarning: License classifiers are deprecated. !!

      ********************************************************************************
      Please consider removing the following classifiers in favor of a SPDX license expression:

      License :: OSI Approved :: BSD License

      See https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license for details.
      ********************************************************************************

!! self._finalize_license_expression() running clean removing 'build/temp.linux-x86_64-cpython-310' (and everything under it) removing 'build/lib.linux-x86_64-cpython-310' (and everything under it) 'build/bdist.linux-x86_64' does not exist -- can't clean it 'build/scripts-3.10' does not exist -- can't clean it removing 'build' Failed to build grouped_gemm

@BruceYu-Bit 如果你是用 pip install git+https://github.com/InternLM/GroupedGEMM.git@main 不行,尝试下载源码然后pip install:

git clone https://github.com/InternLM/GroupedGEMM
cd GroupedGEMM
pip install -v --no-build-isolation -e .

CyCle1024 avatar Sep 18 '25 06:09 CyCle1024

按照官方文档安装GroupedGEMM,无法build。报错如下 Running setup.py clean for grouped_gemm Running command python setup.py clean /root/anaconda3/envs/xtuner/lib/python3.10/site-packages/setuptools/dist.py:759: SetuptoolsDeprecationWarning: License classifiers are deprecated. !!

      ********************************************************************************
      Please consider removing the following classifiers in favor of a SPDX license expression:

      License :: OSI Approved :: BSD License

      See https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license for details.
      ********************************************************************************

!! self._finalize_license_expression() running clean removing 'build/temp.linux-x86_64-cpython-310' (and everything under it) removing 'build/lib.linux-x86_64-cpython-310' (and everything under it) 'build/bdist.linux-x86_64' does not exist -- can't clean it 'build/scripts-3.10' does not exist -- can't clean it removing 'build' Failed to build grouped_gemm

@BruceYu-Bit 如果你是用 pip install git+https://github.com/InternLM/GroupedGEMM.git@main 不行,尝试下载源码然后pip install:

git clone https://github.com/InternLM/GroupedGEMM cd GroupedGEMM pip install -v --no-build-isolation -e .

我的安装方式时下载了https://github.com/InternLM/GroupedGEMM.git, 并下载了cutclass的指定commit号, 然后使用pip install -v --no-build-isolation -e . 出现了上述的错误,环境如下, python版本尝试了3.10, 3.11都不太行:`name: xtunerv1 channels:

  • defaults
  • https://repo.anaconda.com/pkgs/main
  • https://repo.anaconda.com/pkgs/r dependencies:
  • _libgcc_mutex=0.1=main
  • _openmp_mutex=5.1=1_gnu
  • bzip2=1.0.8=h5eee18b_6
  • ca-certificates=2025.9.9=h06a4308_0
  • ld_impl_linux-64=2.40=h12ee557_0
  • libffi=3.4.4=h6a678d5_1
  • libgcc-ng=11.2.0=h1234567_1
  • libgomp=11.2.0=h1234567_1
  • libstdcxx-ng=11.2.0=h1234567_1
  • libuuid=1.41.5=h5eee18b_0
  • libxcb=1.17.0=h9b100fa_0
  • libzlib=1.3.1=hb25bd0a_0
  • ncurses=6.5=h7934f7d_0
  • openssl=1.1.1w=h7f8727e_0
  • pip=25.2=pyhc872135_0
  • pthread-stubs=0.3=h0ce48e5_1
  • python=3.11.0=h7a1cb2a_3
  • readline=8.3=hc2a1206_0
  • setuptools=78.1.1=py311h06a4308_0
  • sqlite=3.50.2=hb25bd0a_1
  • tk=8.6.15=h54e0aa7_0
  • wheel=0.45.1=py311h06a4308_0
  • xorg-libx11=1.8.12=h9b100fa_1
  • xorg-libxau=1.0.12=h9b100fa_0
  • xorg-libxdmcp=1.1.5=h9b100fa_0
  • xorg-xorgproto=2024.1=h5eee18b_1
  • xz=5.6.4=h5eee18b_1
  • zlib=1.3.1=hb25bd0a_0
  • pip:
    • absl-py==2.3.1
    • accelerate==1.10.1
    • addict==2.4.0
    • aiohappyeyeballs==2.6.1
    • aiohttp==3.12.15
    • aiohttp-cors==0.8.1
    • aiosignal==1.4.0
    • annotated-types==0.7.0
    • anyio==4.10.0
    • attrs==25.3.0
    • bitsandbytes==0.45.0
    • cachetools==5.5.2
    • certifi==2025.8.3
    • charset-normalizer==3.4.3
    • click==8.2.1
    • colorful==0.5.7
    • contourpy==1.3.3
    • cycler==0.12.1
    • cyclopts==3.24.0
    • datasets==3.6.0
    • dill==0.3.8
    • distlib==0.4.0
    • docstring-parser==0.17.0
    • docutils==0.22
    • einops==0.8.1
    • et-xmlfile==2.0.0
    • fastapi==0.116.2
    • filelock==3.19.1
    • fonttools==4.59.2
    • frozenlist==1.7.0
    • fsspec==2025.3.0
    • google-api-core==2.25.1
    • google-auth==2.40.3
    • googleapis-common-protos==1.70.0
    • grpcio==1.75.0
    • h11==0.16.0
    • hf-xet==1.1.10
    • httpcore==1.0.9
    • httpx==0.28.1
    • huggingface-hub==0.35.0
    • idna==3.10
    • imageio==2.37.0
    • importlib-metadata==8.7.0
    • jinja2==3.1.6
    • jsonschema==4.25.1
    • jsonschema-specifications==2025.9.1
    • kiwisolver==1.4.9
    • lazy-loader==0.4
    • loguru==0.7.3
    • markdown==3.9
    • markdown-it-py==4.0.0
    • markupsafe==3.0.2
    • matplotlib==3.10.6
    • mdurl==0.1.2
    • mmengine==0.11.0rc0
    • mpmath==1.3.0
    • msgpack==1.1.1
    • multidict==6.6.4
    • multiprocess==0.70.16
    • networkx==3.5
    • numpy==2.2.6
    • nvidia-cublas-cu12==12.6.4.1
    • nvidia-cuda-cupti-cu12==12.6.80
    • nvidia-cuda-nvrtc-cu12==12.6.77
    • nvidia-cuda-runtime-cu12==12.6.77
    • nvidia-cudnn-cu12==9.5.1.17
    • nvidia-cufft-cu12==11.3.0.4
    • nvidia-cufile-cu12==1.11.1.6
    • nvidia-curand-cu12==10.3.7.77
    • nvidia-cusolver-cu12==11.7.1.2
    • nvidia-cusparse-cu12==12.5.4.2
    • nvidia-cusparselt-cu12==0.6.3
    • nvidia-nccl-cu12==2.26.2
    • nvidia-nvjitlink-cu12==12.6.85
    • nvidia-nvtx-cu12==12.6.77
    • opencensus==0.11.4
    • opencensus-context==0.1.3
    • opencv-python-headless==4.12.0.88
    • openpyxl==3.1.5
    • opentelemetry-api==1.37.0
    • opentelemetry-exporter-prometheus==0.58b0
    • opentelemetry-proto==1.37.0
    • opentelemetry-sdk==1.37.0
    • opentelemetry-semantic-conventions==0.58b0
    • packaging==25.0
    • pandas==2.3.2
    • peft==0.17.1
    • pillow==11.3.0
    • platformdirs==4.4.0
    • prometheus-client==0.22.1
    • propcache==0.3.2
    • proto-plus==1.26.1
    • protobuf==6.32.1
    • psutil==7.0.0
    • py-spy==0.4.1
    • pyarrow==21.0.0
    • pyasn1==0.6.1
    • pyasn1-modules==0.4.2
    • pydantic==2.11.9
    • pydantic-core==2.33.2
    • pygments==2.19.2
    • pyparsing==3.2.4
    • python-dateutil==2.9.0.post0
    • pytz==2025.2
    • pyyaml==6.0.2
    • ray==2.49.1
    • referencing==0.36.2
    • regex==2025.9.1
    • requests==2.32.5
    • rich==14.1.0
    • rich-rst==1.3.1
    • rpds-py==0.27.1
    • rsa==4.9.1
    • safetensors==0.6.2
    • scikit-image==0.25.2
    • scipy==1.16.2
    • sentencepiece==0.2.1
    • six==1.17.0
    • smart-open==7.3.1
    • sniffio==1.3.1
    • starlette==0.48.0
    • sympy==1.14.0
    • tensorboard==2.20.0
    • tensorboard-data-server==0.7.2
    • termcolor==3.1.0
    • tifffile==2025.9.9
    • tiktoken==0.11.0
    • timm==1.0.19
    • tokenizers==0.22.0
    • torch==2.7.0
    • torchvision==0.22.0
    • tqdm==4.67.1
    • transformers==4.56.0
    • transformers-stream-generator==0.0.5
    • triton==3.3.0
    • typing-extensions==4.15.0
    • typing-inspection==0.4.1
    • tzdata==2025.2
    • urllib3==2.5.0
    • uvicorn==0.35.0
    • virtualenv==20.34.0
    • werkzeug==3.1.3
    • wrapt==1.17.3
    • xtuner==0.2.0
    • xxhash==3.5.0
    • yapf==0.43.0
    • yarl==1.20.1
    • zipp==3.23.0 prefix: /root/anaconda3/envs/xtunerv1 `

BruceYu-Bit avatar Sep 18 '25 08:09 BruceYu-Bit

提供一下完整的命令行日志吧,最好能够体现完整的安装步骤,局部的报错信息不是很清晰

HAOCHENYE avatar Sep 18 '25 15:09 HAOCHENYE

按照官方文档安装GroupedGEMM,无法build。报错如下 Running setup.py clean for grouped_gemm Running command python setup.py clean /root/anaconda3/envs/xtuner/lib/python3.10/site-packages/setuptools/dist.py:759: SetuptoolsDeprecationWarning: License classifiers are deprecated. !!

      ********************************************************************************
      Please consider removing the following classifiers in favor of a SPDX license expression:

      License :: OSI Approved :: BSD License

      See https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license for details.
      ********************************************************************************

!! self._finalize_license_expression() running clean removing 'build/temp.linux-x86_64-cpython-310' (and everything under it) removing 'build/lib.linux-x86_64-cpython-310' (and everything under it) 'build/bdist.linux-x86_64' does not exist -- can't clean it 'build/scripts-3.10' does not exist -- can't clean it removing 'build' Failed to build grouped_gemm

如果你是第一次下载GroupedGemm源码并编译,是不会出现python setup.py clean的执行逻辑的,我觉得问题复现的步骤可能不完全。你上述的报错主要是python setup.py clean导致的,建议提供完整的命令行执行的打印结果,可以上传附件之类的。

我通过你提供的conda环境yaml生成conda env,运行pip install -vv git+https://github.com/InternLM/GroupedGEMM.git@main 并不会报错,如果你的环境无法访问外网,建议更为详细描述你安装GroupedGemm的步骤。

CyCle1024 avatar Sep 19 '25 06:09 CyCle1024

按照官方文档安装GroupedGEMM,无法build。报错如下 Running setup.py clean for grouped_gemm Running command python setup.py clean /root/anaconda3/envs/xtuner/lib/python3.10/site-packages/setuptools/dist.py:759: SetuptoolsDeprecationWarning: License classifiers are deprecated. !!

      ********************************************************************************
      Please consider removing the following classifiers in favor of a SPDX license expression:

      License :: OSI Approved :: BSD License

      See https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license for details.
      ********************************************************************************

!! self._finalize_license_expression() running clean removing 'build/temp.linux-x86_64-cpython-310' (and everything under it) removing 'build/lib.linux-x86_64-cpython-310' (and everything under it) 'build/bdist.linux-x86_64' does not exist -- can't clean it 'build/scripts-3.10' does not exist -- can't clean it removing 'build' Failed to build grouped_gemm

@BruceYu-Bit 如果你是用 pip install git+https://github.com/InternLM/GroupedGEMM.git@main 不行,尝试下载源码然后pip install:

git clone https://github.com/InternLM/GroupedGEMM cd GroupedGEMM pip install -v --no-build-isolation -e .

@BruceYu-Bit 不好意思这里少了一个 --recursive 参数,应该是:

git clone --recursive https://github.com/InternLM/GroupedGEMM
cd GroupedGEMM
pip install -v --no-build-isolation -e .

这个方式我在你提供的yaml对应的conda env中也是可以安装的。你最好能这样贴一下日志,尤其是删了build目录后再安装的日志。

CyCle1024 avatar Sep 19 '25 06:09 CyCle1024

按照官方文档安装GroupedGEMM,无法build。报错如下 Running setup.py clean for grouped_gemm Running command python setup.py clean /root/anaconda3/envs/xtuner/lib/python3.10/site-packages/setuptools/dist.py:759: SetuptoolsDeprecationWarning: License classifiers are deprecated. !!

      ********************************************************************************
      Please consider removing the following classifiers in favor of a SPDX license expression:

      License :: OSI Approved :: BSD License

      See https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license for details.
      ********************************************************************************

!! self._finalize_license_expression() running clean removing 'build/temp.linux-x86_64-cpython-310' (and everything under it) removing 'build/lib.linux-x86_64-cpython-310' (and everything under it) 'build/bdist.linux-x86_64' does not exist -- can't clean it 'build/scripts-3.10' does not exist -- can't clean it removing 'build' Failed to build grouped_gemm

@BruceYu-Bit 如果你是用 pip install git+https://github.com/InternLM/GroupedGEMM.git@main 不行,尝试下载源码然后pip install: git clone https://github.com/InternLM/GroupedGEMM cd GroupedGEMM pip install -v --no-build-isolation -e .

@BruceYu-Bit 不好意思这里少了一个 --recursive 参数,应该是:

git clone --recursive https://github.com/InternLM/GroupedGEMM cd GroupedGEMM pip install -v --no-build-isolation -e . 这个方式我在你提供的yaml对应的conda env中也是可以安装的。你最好能这样贴一下日志,尤其是删了build目录后再安装的日志。 @CyCle1024 我的复现步骤如下:

  1. git clone --recursive https://github.com/InternLM/GroupedGEMM 能下载GroupedGEMM, 下载cutlass失败
  2. 手动下载cutlass并切换到相应的commit号
  3. 运行编译步骤,cd GroupedGEMM pip install -v --no-build-isolation -e . 完整报错如附件

error.txt

(删除了build的):

BruceYu-Bit avatar Sep 19 '25 07:09 BruceYu-Bit

能不能看看 您 nvcc 的版本?比如运行:

nvcc --version

有点怀疑是 CUDA toolkit 版本问题

windreamer avatar Sep 19 '25 08:09 windreamer

能不能看看 您 nvcc 的版本?比如运行:

nvcc --version

有点怀疑是 CUDA toolkit 版本问题

nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:16:06_PDT_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0

BruceYu-Bit avatar Sep 22 '25 03:09 BruceYu-Bit

我们这边构建成功的cuda toolkit版本为cuda 12.8。推荐使用这个版本 @BruceYu-Bit

---- 回复的原邮件 ---- | 发件人 | @.> | | 日期 | 2025年09月22日 11:39 | | 收件人 | @.> | | 抄送至 | @.>@.> | | 主题 | Re: [InternLM/xtuner] [紧急!!!]训练 MoE 模型建议额外安装 GroupedGEMM失败 (Issue #1105) | BruceYu-Bit left a comment (InternLM/xtuner#1105)

能不能看看 您 nvcc 的版本?比如运行:

nvcc --version

有点怀疑是 CUDA toolkit 版本问题

nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:16:06_PDT_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

CyCle1024 avatar Sep 22 '25 04:09 CyCle1024

能不能提供个docker啊

Opdoop avatar Sep 23 '25 03:09 Opdoop

能不能提供个docker啊

xtuner的项目根目录有 Dockerfile 和 构建脚本 image_build.sh (需要项目根目录下执行)。构建出来的镜像为本地的xtuner:${commit_sha}。

近期我们也将推送的cuda平台xtuner镜像,如果完成我会在这个issue中告知你。

CyCle1024 avatar Sep 23 '25 07:09 CyCle1024

我们这边构建成功的cuda toolkit版本为cuda 12.8。推荐使用这个版本 @BruceYu-Bit

感谢~ 12.8 可以解决

BruceYu-Bit avatar Sep 24 '25 08:09 BruceYu-Bit

@CyCle1024 求docker

Opdoop avatar Sep 29 '25 01:09 Opdoop

@CyCle1024 求docker

抱歉,最近组织内讨论结果是要在稳定发布V1的新版本时提供镜像,目前还没有官方镜像可以提供,这个月底前会发版本。

CyCle1024 avatar Oct 13 '25 14:10 CyCle1024

@Opdoop @BruceYu-Bit 目前RC版本的镜像我刚推送了,可以通过以下命令拉取:

docker pull openmmlab/xtuner:pt28_20251104_4990d05_rc

CyCle1024 avatar Nov 04 '25 13:11 CyCle1024