ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: errors of combine shard_init and tensor parallel in rlhf examples

Open ziyuwan opened this issue 2 years ago β€’ 2 comments

πŸ› Describe the bug

I'm working on the RLHF implementation in application/ChatGPT,

Can I use LoRA in my actor/critic and turn shard_init on while also tensor parallelize my model after the initialization? I' ve tried several combinations all of the above functionalities and I've found problems under the following settings:

  • LoRA + shard_init + TP: I modify the construction function of LoRALinear in order to provide it the correct weight shape, but I found something went wrong resulting in NotImplementedError raised in colossalai/nn/_ops/embedding.py:130
  • shard_init + TP: also the same NotImplementedError image

And also a noob question: when I use shard_init, I cannot let tp_degree different from nproc_per_node which raises error image

While (LoRA + TP) runs well so I think maybe the problem is purely because my implementation of tensor parallelization has some conflicts with the provided scripts?

So here I first provide the TP code and my running scripts first. And about LoRA problem since it needs to provide another file, so maybe we can see it after (shard_init + TP) runs right.

Reproduction

my_train_dummy.py.zip also modify GPTActor and GPTCritic by adding an input of lora_rank like the following code image

Environment

torch 1.13.1 torchaudio 0.13.1 torchvision 0.14.1 colossalai 0.2.5

ziyuwan avatar Mar 13 '23 06:03 ziyuwan

Bot detected the issue body's language is not English, translate it automatically. πŸ‘―πŸ‘­πŸ»πŸ§‘β€πŸ€β€πŸ§‘πŸ‘«πŸ§‘πŸΏβ€πŸ€β€πŸ§‘πŸ»πŸ‘©πŸΎβ€πŸ€β€πŸ‘¨πŸΏπŸ‘¬πŸΏ


Title: [BUG]:

Issues-translate-bot avatar Mar 13 '23 06:03 Issues-translate-bot

we are supporting Tensor paraller for a transformer network now, it is not complete, you can find the PR:https://github.com/hpcaitech/ColossalAI/pull/3070, which is still in process

Fazziekey avatar Mar 14 '23 01:03 Fazziekey

hi @ziyuwan "use LoRA in my actor/critic and turn shard_init on" Have you solve the problem now ?

taishiciR avatar Mar 28 '23 13:03 taishiciR

Hi @ziyuwan @taishiciR As mentioned in Chat example, --strategy 'colossalai_gemini' or 'colossalai_zero2' is enough for most cases. TP is not supported for Chat currently. It is relatively low on our priority list for open source planning. If you need customized in-depth cooperation or support, please send the details to [email protected] We also have updated a lot. Please check the latest code. This issue was closed due to inactivity. Thanks.

binmakeswell avatar May 05 '23 04:05 binmakeswell