Hongxin Liu comments

Results 70 comments of


                                            Hongxin Liu

[checkpointio]support distributed checkpoint io for model saving.

DON'T merge to main. Create a new feature branch on the org repo and merge to it.

OPT demo TEST

Could you reinstall the latest version of colossalai?

[FEATURE]: Is it Possible to integrate Liger-Kernel?

Does it compare with apex's implementation? We've integrate some apex cuda kernels and some of them are also implemented in Liger-kernel.

Hybrid Parallel Plugin下TP显存比同配置下deepspeed要高？？？

Deepspeed zero-3是完全切分权重而TP并不完全切分（例如非Linear/Embedding层）。当Activation较小时这种情况有可能发生，请提供更详细的信息

【Question】Question about initial finetune loss

As lora weight is initialied from random

[BUG]: loading OPT 66B model - CPU runs out of memory

> @Edenzzzz I am using [script](https://github.com/hpcaitech/ColossalAI/blob/v0.3.6/examples/language/llama2/finetune.py). I am using dataset [yizhongw/self_instruct](https://huggingface.co/datasets/yizhongw/self_instruct) > > Eval logs for model trained with Hybrid Parallel plugin and pp_size=4 and tp_size=4 > > ``` >...

[BUG]: /bin/bash: line 0: export: `NPU-VISIBLE-DEVICES=0,1,2,3,4,5,6,7': not a valid identifier

`NPU-VISIBLE-DEVICES`是本地设置的环境变量吗？正确格式应该是`NPU_VISIBLE_DEVICES`?

[BUG]: 该如何安装colossal到NPU上，看项目有相关描述，但没找到相关教程

我们提供了昇腾的Torch基础镜像：`docker pull hpcaitech/pytorch-npu:2.4.0` 在此基础上直接安装colossalai即可：安装最新稳定版`pip install colossalai` 或者安装main分支`pip install git+https://github.com/hpcaitech/ColossalAI.git`

[BUG]: 该如何安装colossal到NPU上，看项目有相关描述，但没找到相关教程

flash_attn is not available on NPU devices. DON'T install flash_attn and make a dummy directory in your python packages path. E.g. ```bash mkdir .conda/envs/myenv/lib/python3.10/site-packages/flash_attn touch .conda/envs/myenv/lib/python3.10/site-packages/flash_attn/__init__.py ```

an error caused by running the example of the opt

Hi, could you reinstall the latest colossalai?