[紧急!!!]训练 MoE 模型建议额外安装 GroupedGEMM失败
按照官方文档安装GroupedGEMM,无法build。报错如下 Running setup.py clean for grouped_gemm Running command python setup.py clean /root/anaconda3/envs/xtuner/lib/python3.10/site-packages/setuptools/dist.py:759: SetuptoolsDeprecationWarning: License classifiers are deprecated. !!
********************************************************************************
Please consider removing the following classifiers in favor of a SPDX license expression:
License :: OSI Approved :: BSD License
See https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license for details.
********************************************************************************
!! self._finalize_license_expression() running clean removing 'build/temp.linux-x86_64-cpython-310' (and everything under it) removing 'build/lib.linux-x86_64-cpython-310' (and everything under it) 'build/bdist.linux-x86_64' does not exist -- can't clean it 'build/scripts-3.10' does not exist -- can't clean it removing 'build' Failed to build grouped_gemm
@BruceYu-Bit 请补充你安装方式(如果是官方文档是否为:pip install git+https://github.com/InternLM/GroupedGEMM.git@main),以及使用的conda环境,可以用conda env export -n xtuner > xtuner.yaml,然后给出xtuner.yaml,我这边可以尝试复现一下。
按照官方文档安装GroupedGEMM,无法build。报错如下 Running setup.py clean for grouped_gemm Running command python setup.py clean /root/anaconda3/envs/xtuner/lib/python3.10/site-packages/setuptools/dist.py:759: SetuptoolsDeprecationWarning: License classifiers are deprecated. !!
******************************************************************************** Please consider removing the following classifiers in favor of a SPDX license expression: License :: OSI Approved :: BSD License See https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license for details. ********************************************************************************!! self._finalize_license_expression() running clean removing 'build/temp.linux-x86_64-cpython-310' (and everything under it) removing 'build/lib.linux-x86_64-cpython-310' (and everything under it) 'build/bdist.linux-x86_64' does not exist -- can't clean it 'build/scripts-3.10' does not exist -- can't clean it removing 'build' Failed to build grouped_gemm
@BruceYu-Bit 如果你是用 pip install git+https://github.com/InternLM/GroupedGEMM.git@main 不行,尝试下载源码然后pip install:
git clone https://github.com/InternLM/GroupedGEMM
cd GroupedGEMM
pip install -v --no-build-isolation -e .
按照官方文档安装GroupedGEMM,无法build。报错如下 Running setup.py clean for grouped_gemm Running command python setup.py clean /root/anaconda3/envs/xtuner/lib/python3.10/site-packages/setuptools/dist.py:759: SetuptoolsDeprecationWarning: License classifiers are deprecated. !!
******************************************************************************** Please consider removing the following classifiers in favor of a SPDX license expression: License :: OSI Approved :: BSD License See https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license for details. ********************************************************************************!! self._finalize_license_expression() running clean removing 'build/temp.linux-x86_64-cpython-310' (and everything under it) removing 'build/lib.linux-x86_64-cpython-310' (and everything under it) 'build/bdist.linux-x86_64' does not exist -- can't clean it 'build/scripts-3.10' does not exist -- can't clean it removing 'build' Failed to build grouped_gemm
@BruceYu-Bit 如果你是用
pip install git+https://github.com/InternLM/GroupedGEMM.git@main不行,尝试下载源码然后pip install:git clone https://github.com/InternLM/GroupedGEMM cd GroupedGEMM pip install -v --no-build-isolation -e .
我的安装方式时下载了https://github.com/InternLM/GroupedGEMM.git, 并下载了cutclass的指定commit号, 然后使用pip install -v --no-build-isolation -e . 出现了上述的错误,环境如下, python版本尝试了3.10, 3.11都不太行:`name: xtunerv1 channels:
- defaults
- https://repo.anaconda.com/pkgs/main
- https://repo.anaconda.com/pkgs/r dependencies:
- _libgcc_mutex=0.1=main
- _openmp_mutex=5.1=1_gnu
- bzip2=1.0.8=h5eee18b_6
- ca-certificates=2025.9.9=h06a4308_0
- ld_impl_linux-64=2.40=h12ee557_0
- libffi=3.4.4=h6a678d5_1
- libgcc-ng=11.2.0=h1234567_1
- libgomp=11.2.0=h1234567_1
- libstdcxx-ng=11.2.0=h1234567_1
- libuuid=1.41.5=h5eee18b_0
- libxcb=1.17.0=h9b100fa_0
- libzlib=1.3.1=hb25bd0a_0
- ncurses=6.5=h7934f7d_0
- openssl=1.1.1w=h7f8727e_0
- pip=25.2=pyhc872135_0
- pthread-stubs=0.3=h0ce48e5_1
- python=3.11.0=h7a1cb2a_3
- readline=8.3=hc2a1206_0
- setuptools=78.1.1=py311h06a4308_0
- sqlite=3.50.2=hb25bd0a_1
- tk=8.6.15=h54e0aa7_0
- wheel=0.45.1=py311h06a4308_0
- xorg-libx11=1.8.12=h9b100fa_1
- xorg-libxau=1.0.12=h9b100fa_0
- xorg-libxdmcp=1.1.5=h9b100fa_0
- xorg-xorgproto=2024.1=h5eee18b_1
- xz=5.6.4=h5eee18b_1
- zlib=1.3.1=hb25bd0a_0
- pip:
- absl-py==2.3.1
- accelerate==1.10.1
- addict==2.4.0
- aiohappyeyeballs==2.6.1
- aiohttp==3.12.15
- aiohttp-cors==0.8.1
- aiosignal==1.4.0
- annotated-types==0.7.0
- anyio==4.10.0
- attrs==25.3.0
- bitsandbytes==0.45.0
- cachetools==5.5.2
- certifi==2025.8.3
- charset-normalizer==3.4.3
- click==8.2.1
- colorful==0.5.7
- contourpy==1.3.3
- cycler==0.12.1
- cyclopts==3.24.0
- datasets==3.6.0
- dill==0.3.8
- distlib==0.4.0
- docstring-parser==0.17.0
- docutils==0.22
- einops==0.8.1
- et-xmlfile==2.0.0
- fastapi==0.116.2
- filelock==3.19.1
- fonttools==4.59.2
- frozenlist==1.7.0
- fsspec==2025.3.0
- google-api-core==2.25.1
- google-auth==2.40.3
- googleapis-common-protos==1.70.0
- grpcio==1.75.0
- h11==0.16.0
- hf-xet==1.1.10
- httpcore==1.0.9
- httpx==0.28.1
- huggingface-hub==0.35.0
- idna==3.10
- imageio==2.37.0
- importlib-metadata==8.7.0
- jinja2==3.1.6
- jsonschema==4.25.1
- jsonschema-specifications==2025.9.1
- kiwisolver==1.4.9
- lazy-loader==0.4
- loguru==0.7.3
- markdown==3.9
- markdown-it-py==4.0.0
- markupsafe==3.0.2
- matplotlib==3.10.6
- mdurl==0.1.2
- mmengine==0.11.0rc0
- mpmath==1.3.0
- msgpack==1.1.1
- multidict==6.6.4
- multiprocess==0.70.16
- networkx==3.5
- numpy==2.2.6
- nvidia-cublas-cu12==12.6.4.1
- nvidia-cuda-cupti-cu12==12.6.80
- nvidia-cuda-nvrtc-cu12==12.6.77
- nvidia-cuda-runtime-cu12==12.6.77
- nvidia-cudnn-cu12==9.5.1.17
- nvidia-cufft-cu12==11.3.0.4
- nvidia-cufile-cu12==1.11.1.6
- nvidia-curand-cu12==10.3.7.77
- nvidia-cusolver-cu12==11.7.1.2
- nvidia-cusparse-cu12==12.5.4.2
- nvidia-cusparselt-cu12==0.6.3
- nvidia-nccl-cu12==2.26.2
- nvidia-nvjitlink-cu12==12.6.85
- nvidia-nvtx-cu12==12.6.77
- opencensus==0.11.4
- opencensus-context==0.1.3
- opencv-python-headless==4.12.0.88
- openpyxl==3.1.5
- opentelemetry-api==1.37.0
- opentelemetry-exporter-prometheus==0.58b0
- opentelemetry-proto==1.37.0
- opentelemetry-sdk==1.37.0
- opentelemetry-semantic-conventions==0.58b0
- packaging==25.0
- pandas==2.3.2
- peft==0.17.1
- pillow==11.3.0
- platformdirs==4.4.0
- prometheus-client==0.22.1
- propcache==0.3.2
- proto-plus==1.26.1
- protobuf==6.32.1
- psutil==7.0.0
- py-spy==0.4.1
- pyarrow==21.0.0
- pyasn1==0.6.1
- pyasn1-modules==0.4.2
- pydantic==2.11.9
- pydantic-core==2.33.2
- pygments==2.19.2
- pyparsing==3.2.4
- python-dateutil==2.9.0.post0
- pytz==2025.2
- pyyaml==6.0.2
- ray==2.49.1
- referencing==0.36.2
- regex==2025.9.1
- requests==2.32.5
- rich==14.1.0
- rich-rst==1.3.1
- rpds-py==0.27.1
- rsa==4.9.1
- safetensors==0.6.2
- scikit-image==0.25.2
- scipy==1.16.2
- sentencepiece==0.2.1
- six==1.17.0
- smart-open==7.3.1
- sniffio==1.3.1
- starlette==0.48.0
- sympy==1.14.0
- tensorboard==2.20.0
- tensorboard-data-server==0.7.2
- termcolor==3.1.0
- tifffile==2025.9.9
- tiktoken==0.11.0
- timm==1.0.19
- tokenizers==0.22.0
- torch==2.7.0
- torchvision==0.22.0
- tqdm==4.67.1
- transformers==4.56.0
- transformers-stream-generator==0.0.5
- triton==3.3.0
- typing-extensions==4.15.0
- typing-inspection==0.4.1
- tzdata==2025.2
- urllib3==2.5.0
- uvicorn==0.35.0
- virtualenv==20.34.0
- werkzeug==3.1.3
- wrapt==1.17.3
- xtuner==0.2.0
- xxhash==3.5.0
- yapf==0.43.0
- yarl==1.20.1
- zipp==3.23.0 prefix: /root/anaconda3/envs/xtunerv1 `
提供一下完整的命令行日志吧,最好能够体现完整的安装步骤,局部的报错信息不是很清晰
按照官方文档安装GroupedGEMM,无法build。报错如下 Running setup.py clean for grouped_gemm Running command python setup.py clean /root/anaconda3/envs/xtuner/lib/python3.10/site-packages/setuptools/dist.py:759: SetuptoolsDeprecationWarning: License classifiers are deprecated. !!
******************************************************************************** Please consider removing the following classifiers in favor of a SPDX license expression: License :: OSI Approved :: BSD License See https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license for details. ********************************************************************************!! self._finalize_license_expression() running clean removing 'build/temp.linux-x86_64-cpython-310' (and everything under it) removing 'build/lib.linux-x86_64-cpython-310' (and everything under it) 'build/bdist.linux-x86_64' does not exist -- can't clean it 'build/scripts-3.10' does not exist -- can't clean it removing 'build' Failed to build grouped_gemm
如果你是第一次下载GroupedGemm源码并编译,是不会出现python setup.py clean的执行逻辑的,我觉得问题复现的步骤可能不完全。你上述的报错主要是python setup.py clean导致的,建议提供完整的命令行执行的打印结果,可以上传附件之类的。
我通过你提供的conda环境yaml生成conda env,运行pip install -vv git+https://github.com/InternLM/GroupedGEMM.git@main 并不会报错,如果你的环境无法访问外网,建议更为详细描述你安装GroupedGemm的步骤。
按照官方文档安装GroupedGEMM,无法build。报错如下 Running setup.py clean for grouped_gemm Running command python setup.py clean /root/anaconda3/envs/xtuner/lib/python3.10/site-packages/setuptools/dist.py:759: SetuptoolsDeprecationWarning: License classifiers are deprecated. !!
******************************************************************************** Please consider removing the following classifiers in favor of a SPDX license expression: License :: OSI Approved :: BSD License See https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license for details. ********************************************************************************!! self._finalize_license_expression() running clean removing 'build/temp.linux-x86_64-cpython-310' (and everything under it) removing 'build/lib.linux-x86_64-cpython-310' (and everything under it) 'build/bdist.linux-x86_64' does not exist -- can't clean it 'build/scripts-3.10' does not exist -- can't clean it removing 'build' Failed to build grouped_gemm
@BruceYu-Bit 如果你是用
pip install git+https://github.com/InternLM/GroupedGEMM.git@main不行,尝试下载源码然后pip install:git clone https://github.com/InternLM/GroupedGEMM cd GroupedGEMM pip install -v --no-build-isolation -e .
@BruceYu-Bit 不好意思这里少了一个 --recursive 参数,应该是:
git clone --recursive https://github.com/InternLM/GroupedGEMM
cd GroupedGEMM
pip install -v --no-build-isolation -e .
这个方式我在你提供的yaml对应的conda env中也是可以安装的。你最好能这样贴一下日志,尤其是删了build目录后再安装的日志。
按照官方文档安装GroupedGEMM,无法build。报错如下 Running setup.py clean for grouped_gemm Running command python setup.py clean /root/anaconda3/envs/xtuner/lib/python3.10/site-packages/setuptools/dist.py:759: SetuptoolsDeprecationWarning: License classifiers are deprecated. !!
******************************************************************************** Please consider removing the following classifiers in favor of a SPDX license expression: License :: OSI Approved :: BSD License See https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license for details. ********************************************************************************!! self._finalize_license_expression() running clean removing 'build/temp.linux-x86_64-cpython-310' (and everything under it) removing 'build/lib.linux-x86_64-cpython-310' (and everything under it) 'build/bdist.linux-x86_64' does not exist -- can't clean it 'build/scripts-3.10' does not exist -- can't clean it removing 'build' Failed to build grouped_gemm
@BruceYu-Bit 如果你是用
pip install git+https://github.com/InternLM/GroupedGEMM.git@main不行,尝试下载源码然后pip install: git clone https://github.com/InternLM/GroupedGEMM cd GroupedGEMM pip install -v --no-build-isolation -e .@BruceYu-Bit 不好意思这里少了一个
--recursive参数,应该是:git clone --recursive https://github.com/InternLM/GroupedGEMM cd GroupedGEMM pip install -v --no-build-isolation -e . 这个方式我在你提供的yaml对应的conda env中也是可以安装的。你最好能这样贴一下日志,尤其是删了build目录后再安装的日志。 @CyCle1024 我的复现步骤如下:
- git clone --recursive https://github.com/InternLM/GroupedGEMM 能下载GroupedGEMM, 下载cutlass失败
- 手动下载cutlass并切换到相应的commit号
- 运行编译步骤,cd GroupedGEMM pip install -v --no-build-isolation -e . 完整报错如附件
(删除了build的):
能不能看看 您 nvcc 的版本?比如运行:
nvcc --version
有点怀疑是 CUDA toolkit 版本问题
能不能看看 您 nvcc 的版本?比如运行:
nvcc --version有点怀疑是 CUDA toolkit 版本问题
nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:16:06_PDT_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0
我们这边构建成功的cuda toolkit版本为cuda 12.8。推荐使用这个版本 @BruceYu-Bit
---- 回复的原邮件 ---- | 发件人 | @.> | | 日期 | 2025年09月22日 11:39 | | 收件人 | @.> | | 抄送至 | @.>@.> | | 主题 | Re: [InternLM/xtuner] [紧急!!!]训练 MoE 模型建议额外安装 GroupedGEMM失败 (Issue #1105) | BruceYu-Bit left a comment (InternLM/xtuner#1105)
能不能看看 您 nvcc 的版本?比如运行:
nvcc --version
有点怀疑是 CUDA toolkit 版本问题
nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:16:06_PDT_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
能不能提供个docker啊
能不能提供个docker啊
xtuner的项目根目录有 Dockerfile 和 构建脚本 image_build.sh (需要项目根目录下执行)。构建出来的镜像为本地的xtuner:${commit_sha}。
近期我们也将推送的cuda平台xtuner镜像,如果完成我会在这个issue中告知你。
@CyCle1024 求docker
@Opdoop @BruceYu-Bit 目前RC版本的镜像我刚推送了,可以通过以下命令拉取:
docker pull openmmlab/xtuner:pt28_20251104_4990d05_rc