ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: failed to install coati in npu docker environment

Open wangyuan249 opened this issue 1 week ago • 0 comments

Is there an existing issue for this bug?

  • [x] I have searched the existing issues

The bug has not been fixed in the latest main branch

  • [x] I have checked the latest main branch

Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)

Yes, I will share a minimal reproducible script.

🐛 Describe the bug

base image: hpcaitech/pytorch-npu:2.4.0

install path: ColossalAI/applications/ColossalChat

pip install .
Looking in indexes: https://pypi.org/simple, https://pypi.tuna.tsinghua.edu.cn/simple
Processing /dpc/wangzy/deepseek/ColossalAI/applications/ColossalChat
  Preparing metadata (setup.py) ... done
Collecting autoflake==2.2.1 (from coati==1.0.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/9e/a5/8471753bc95672fb16d9cd1cb82ba460c66721378dd8cc8629d86c148a09/autoflake-2.2.1-py3-none-any.whl (32 kB)
Collecting black==23.9.1 (from coati==1.0.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/28/c7/150de595f9e5ee1efffeb398acfac3e37d218171100049c77e494326dc4b/black-23.9.1-py3-none-any.whl (182 kB)
Collecting colossalai>=0.4.7 (from coati==1.0.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/15/b3/ef0726bd75bd9348e004a0ae1f4944a747a54f269bf8012dabc9ef129195/colossalai-0.4.8.tar.gz (1.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/1.4 MB 2.9 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Collecting datasets (from coati==1.0.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/30/b7/2622230d4b3540f9c7907664daf9ae6319519f36731a1a39f5ad541efff2/datasets-3.3.1-py3-none-any.whl (484 kB)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/00/23/80a2147a547cb2fd59eb92a13787c849b3efaefcea02a5c963dfc93f7c56/datasets-2.14.7-py3-none-any.whl (520 kB)
Requirement already satisfied: fastapi in /usr/local/lib/python3.10/dist-packages (from coati==1.0.0) (0.115.8)
Collecting flash-attn (from coati==1.0.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/11/34/9bf60e736ed7bbe15055ac2dab48ec67d9dbd088d2b4ae318fd77190ab4e/flash_attn-2.7.4.post1.tar.gz (6.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.0/6.0 MB 2.1 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [19 lines of output]
      /tmp/pip-install-xm5497bv/flash-attn_684681b0091c4779be1d8505325045be/setup.py:106: UserWarning: flash_attn was requested, but nvcc was not found.  Are you sure your environment has nvcc available?  If you're installing within a container from https://hub.docker.com/r/pytorch/pytorch, only images whose names contain 'devel' will provide nvcc.
        warnings.warn(
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-xm5497bv/flash-attn_684681b0091c4779be1d8505325045be/setup.py", line 198, in <module>
          CUDAExtension(
        File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1076, in CUDAExtension
          library_dirs += library_paths(cuda=True)
        File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1207, in library_paths
          if (not os.path.exists(_join_cuda_home(lib_dir)) and
        File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 2416, in _join_cuda_home
          raise OSError('CUDA_HOME environment variable is not set. '
      OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
      
      
      torch.__version__  = 2.4.0
      
      
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

because: And we want to run the lora fine-tuning with deepseek 671b model and 4 nodes / 8 910b3 per node.

colossalai run --host 10.2.0.91,10.2.0.92 --nproc_per_node 8  \
  lora_finetune.py --pretrained /dpc/zhanghaobo/deepseek-r1/DeepSeek-R1-BF16-LOCAL  \
  --dataset /dpc/wangzy/deepseek/ColossalAI/lora_sft_data.jsonl --plugin moe  \
  --lr 2e-5 --max_length 256 --g --ep 8 --pp 3   \
  --batch_size 24 --lora_rank 8 --lora_alpha 16  \
  --num_epochs 2 --warmup_steps 8  \
  --tensorboard_dir logs --save_dir /dpc/wangzy/deepseek/DeepSeek-R1-bf16-lora

Environment

No response

wangyuan249 avatar Feb 20 '25 09:02 wangyuan249