Wenxuan Tan

Results 46 comments of Wenxuan Tan

Thanks for your issue. This is probably due to a recent transformers upgrade, so I've fixed it. For multi-node please refer to commands in examples/language/llama/README.md

Please refer to similar examples in Pytorch forum. You can either run docker in host network mode or map a port from container to host. https://discuss.pytorch.org/t/how-to-multi-node-parallel-in-dockers-container/188736 https://discuss.pytorch.org/t/run-multi-node-training-inside-docker/167537

This is not a bug on our end as flash attention doesn't support V100, which is why it's throwing no kernel. You should uninstall flash_attn

> Great! When could it be merged? Thanks a lot. Most likely by May 1st

Hi, thanks for the issue. I reproduced the bug using this script [finetune.zip](https://github.com/hpcaitech/ColossalAI/files/14821735/finetune.zip) This might be due to some unexpected model movement without ZeRO. Mostly ZeRO is used and the...

This happens only when sequence parallel is on and ZeRO is off. We are rebuilding the seq parallel API with ring attention etc., so I've set it to False in...

This bug seems specific to a minority of TP plans. Will take another look

> A quick potential patch is not to use HF's `resize_token_embeddings` and use `nn.functional.pad` to resize tensor while avoiding recreation of `nn.Embedding` (not sure if there are other attributes that...

> > Same problem I met. I use Open-Sora to train, stuck on this step. I notice there exsist a file called 'extensions' in folder ''colossalai/kerner', maybe that's a way...