ColossalAI
ColossalAI copied to clipboard
[BUG]: failed to install coati in npu docker environment
Is there an existing issue for this bug?
- [x] I have searched the existing issues
The bug has not been fixed in the latest main branch
- [x] I have checked the latest main branch
Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)
Yes, I will share a minimal reproducible script.
🐛 Describe the bug
base image: hpcaitech/pytorch-npu:2.4.0
install path: ColossalAI/applications/ColossalChat
pip install .
Looking in indexes: https://pypi.org/simple, https://pypi.tuna.tsinghua.edu.cn/simple
Processing /dpc/wangzy/deepseek/ColossalAI/applications/ColossalChat
Preparing metadata (setup.py) ... done
Collecting autoflake==2.2.1 (from coati==1.0.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/9e/a5/8471753bc95672fb16d9cd1cb82ba460c66721378dd8cc8629d86c148a09/autoflake-2.2.1-py3-none-any.whl (32 kB)
Collecting black==23.9.1 (from coati==1.0.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/28/c7/150de595f9e5ee1efffeb398acfac3e37d218171100049c77e494326dc4b/black-23.9.1-py3-none-any.whl (182 kB)
Collecting colossalai>=0.4.7 (from coati==1.0.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/15/b3/ef0726bd75bd9348e004a0ae1f4944a747a54f269bf8012dabc9ef129195/colossalai-0.4.8.tar.gz (1.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/1.4 MB 2.9 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Collecting datasets (from coati==1.0.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/30/b7/2622230d4b3540f9c7907664daf9ae6319519f36731a1a39f5ad541efff2/datasets-3.3.1-py3-none-any.whl (484 kB)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/00/23/80a2147a547cb2fd59eb92a13787c849b3efaefcea02a5c963dfc93f7c56/datasets-2.14.7-py3-none-any.whl (520 kB)
Requirement already satisfied: fastapi in /usr/local/lib/python3.10/dist-packages (from coati==1.0.0) (0.115.8)
Collecting flash-attn (from coati==1.0.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/11/34/9bf60e736ed7bbe15055ac2dab48ec67d9dbd088d2b4ae318fd77190ab4e/flash_attn-2.7.4.post1.tar.gz (6.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.0/6.0 MB 2.1 MB/s eta 0:00:00
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [19 lines of output]
/tmp/pip-install-xm5497bv/flash-attn_684681b0091c4779be1d8505325045be/setup.py:106: UserWarning: flash_attn was requested, but nvcc was not found. Are you sure your environment has nvcc available? If you're installing within a container from https://hub.docker.com/r/pytorch/pytorch, only images whose names contain 'devel' will provide nvcc.
warnings.warn(
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "/tmp/pip-install-xm5497bv/flash-attn_684681b0091c4779be1d8505325045be/setup.py", line 198, in <module>
CUDAExtension(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1076, in CUDAExtension
library_dirs += library_paths(cuda=True)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1207, in library_paths
if (not os.path.exists(_join_cuda_home(lib_dir)) and
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 2416, in _join_cuda_home
raise OSError('CUDA_HOME environment variable is not set. '
OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
torch.__version__ = 2.4.0
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
because: And we want to run the lora fine-tuning with deepseek 671b model and 4 nodes / 8 910b3 per node.
colossalai run --host 10.2.0.91,10.2.0.92 --nproc_per_node 8 \
lora_finetune.py --pretrained /dpc/zhanghaobo/deepseek-r1/DeepSeek-R1-BF16-LOCAL \
--dataset /dpc/wangzy/deepseek/ColossalAI/lora_sft_data.jsonl --plugin moe \
--lr 2e-5 --max_length 256 --g --ep 8 --pp 3 \
--batch_size 24 --lora_rank 8 --lora_alpha 16 \
--num_epochs 2 --warmup_steps 8 \
--tensorboard_dir logs --save_dir /dpc/wangzy/deepseek/DeepSeek-R1-bf16-lora
Environment
No response