ColossalAI [BUG]: 该如何安装colossal到NPU上，看项目有相关描述，但没找到相关教程

Is there an existing issue for this bug?

[x] I have searched the existing issues

The bug has not been fixed in the latest main branch

[x] I have checked the latest main branch

Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)

Yes, I will share a minimal reproducible script.

🐛 Describe the bug

不知道该怎么安装colossal到NPU上，希望能有一个对应的教程，如何使用extentions部分来为npu安装colossal

Environment

No response

Feb 20 '25 02:02 obj12

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Title: [BUG]: How to install colossal on NPU, see the project has a relevant description, but no relevant tutorial was found

Feb 20 '25 02:02 Issues-translate-bot

我们提供了昇腾的Torch基础镜像：docker pull hpcaitech/pytorch-npu:2.4.0 在此基础上直接安装colossalai即可：安装最新稳定版pip install colossalai 或者安装main分支pip install git+https://github.com/hpcaitech/ColossalAI.git

Feb 20 '25 03:02 ver217

@ver217 Hi ~ I try to install coati in npu docker environment, bug get get error for "NCCL" not ready.

How should we guide the install logic to recognize Ascend "HCCL"。

/dpc/wangzy/deepseek/ColossalAI/applications/ColossalChat# pip install .
Processing /dpc/wangzy/deepseek/ColossalAI/applications/ColossalChat
  Preparing metadata (setup.py) ... done
Requirement already satisfied: transformers==4.39.3 in /root/miniconda3/envs/glm-32b/lib/python3.10/site-packages (from coati==1.0.0) (4.39.3)
Requirement already satisfied: tqdm in /root/miniconda3/envs/glm-32b/lib/python3.10/site-packages (from coati==1.0.0) (4.67.0)
Collecting datasets==2.14.7 (from coati==1.0.0)
  Downloading datasets-2.14.7-py3-none-any.whl.metadata (19 kB)
Collecting loralib (from coati==1.0.0)
  Downloading loralib-0.1.2-py3-none-any.whl.metadata (15 kB)
Requirement already satisfied: colossalai>=0.4.7 in /dpc/wangzy/deepseek/ColossalAI (from coati==1.0.0) (0.4.7)
Requirement already satisfied: torch>=2.1.0 in /root/miniconda3/envs/glm-32b/lib/python3.10/site-packages (from coati==1.0.0) (2.4.1)
Collecting langchain (from coati==1.0.0)
  Downloading langchain-0.3.19-py3-none-any.whl.metadata (7.9 kB)
Requirement already satisfied: tokenizers in /root/miniconda3/envs/glm-32b/lib/python3.10/site-packages (from coati==1.0.0) (0.15.2)
Requirement already satisfied: fastapi in /root/miniconda3/envs/glm-32b/lib/python3.10/site-packages (from coati==1.0.0) (0.115.8)
Collecting sse_starlette (from coati==1.0.0)
  Downloading sse_starlette-2.2.1-py3-none-any.whl.metadata (7.8 kB)
Collecting wandb (from coati==1.0.0)
  Downloading wandb-0.19.6-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.metadata (10 kB)
Requirement already satisfied: sentencepiece in /root/miniconda3/envs/glm-32b/lib/python3.10/site-packages (from coati==1.0.0) (0.2.0)
Collecting gpustat (from coati==1.0.0)
  Downloading gpustat-1.1.1.tar.gz (98 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: packaging in /root/miniconda3/envs/glm-32b/lib/python3.10/site-packages (from coati==1.0.0) (24.1)
Collecting autoflake==2.2.1 (from coati==1.0.0)
  Downloading autoflake-2.2.1-py3-none-any.whl.metadata (7.3 kB)
Collecting black==23.9.1 (from coati==1.0.0)
  Downloading black-23.9.1-py3-none-any.whl.metadata (65 kB)
Requirement already satisfied: tensorboard in /root/miniconda3/envs/glm-32b/lib/python3.10/site-packages (from coati==1.0.0) (2.18.0)
Requirement already satisfied: six==1.16.0 in /root/miniconda3/envs/glm-32b/lib/python3.10/site-packages (from coati==1.0.0) (1.16.0)
Collecting ninja==1.11.1 (from coati==1.0.0)
  Downloading ninja-1.11.1-py2.py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.metadata (5.3 kB)
Collecting sentencepiece (from coati==1.0.0)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.metadata (7.7 kB)
Collecting flash-attn (from coati==1.0.0)
  Downloading flash_attn-2.7.4.post1.tar.gz (6.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.0/6.0 MB 266.9 kB/s eta 0:00:00
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [19 lines of output]
      /tmp/pip-install-smmj8s2o/flash-attn_6d6ea029aca840a68bc86afc3a228298/setup.py:106: UserWarning: flash_attn was requested, but nvcc was not found.  Are you sure your environment has nvcc available?  If you're installing within a container from https://hub.docker.com/r/pytorch/pytorch, only images whose names contain 'devel' will provide nvcc.
        warnings.warn(
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-smmj8s2o/flash-attn_6d6ea029aca840a68bc86afc3a228298/setup.py", line 198, in <module>
          CUDAExtension(
        File "/root/miniconda3/envs/glm-32b/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1076, in CUDAExtension
          library_dirs += library_paths(cuda=True)
        File "/root/miniconda3/envs/glm-32b/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1207, in library_paths
          if (not os.path.exists(_join_cuda_home(lib_dir)) and
        File "/root/miniconda3/envs/glm-32b/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2416, in _join_cuda_home
          raise OSError('CUDA_HOME environment variable is not set. '
      OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
      
      
      torch.__version__  = 2.4.1
      
      
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

And we want to run the lora fine-tuning with deepseek 671b model and 4 nodes / 8 910b3 per node.

colossalai run --host 10.2.0.91,10.2.0.92 --nproc_per_node 8 \
  lora_finetune.py --pretrained /dpc/zhanghaobo/deepseek-r1/DeepSeek-R1-BF16-LOCAL \
  --dataset /dpc/wangzy/deepseek/ColossalAI/lora_sft_data.jsonl --plugin moe \
  --lr 2e-5 --max_length 256 --g --ep 8 --pp 3  \
  --batch_size 24 --lora_rank 8 --lora_alpha 16 \
  --num_epochs 2 --warmup_steps 8 \
  --tensorboard_dir logs --save_dir /dpc/wangzy/deepseek/DeepSeek-R1-bf16-lora

Feb 20 '25 08:02 wangyuan249

flash_attn is not available on NPU devices. DON'T install flash_attn and make a dummy directory in your python packages path. E.g.

mkdir .conda/envs/myenv/lib/python3.10/site-packages/flash_attn
touch .conda/envs/myenv/lib/python3.10/site-packages/flash_attn/__init__.py

Feb 24 '25 07:02 ver217