Wenxuan Tan comments

Results 46 comments of


                                            Wenxuan Tan

[BUG]: Running ColossalAI in H800 with torch 2.0

Thanks for your issue. This is probably due to a recent transformers upgrade, so I've fixed it. For multi-node please refer to commands in examples/language/llama/README.md

[BUG]: Running ColossalAI in H800 with torch 2.0

Please refer to similar examples in Pytorch forum. You can either run docker in host network mode or map a port from container to host. https://discuss.pytorch.org/t/how-to-multi-node-parallel-in-dockers-container/188736 https://discuss.pytorch.org/t/run-multi-node-training-inside-docker/167537

[BUG]: Running ColossalAI in H800 with torch 2.0

This is not a bug on our end as flash attention doesn't support V100, which is why it's throwing no kernel. You should uninstall flash_attn

[Feature] Add Galore (Adam, Adafactor) and distributed GaloreAdamW8bit

> Great! When could it be merged? Thanks a lot. Most likely by May 1st

[BUG]: HybridParallelOptimizer holds unsharded model parameters after sharding

Hi, thanks for the issue. I reproduced the bug using this script [finetune.zip](https://github.com/hpcaitech/ColossalAI/files/14821735/finetune.zip) This might be due to some unexpected model movement without ZeRO. Mostly ZeRO is used and the...

[BUG]: HybridParallelOptimizer holds unsharded model parameters after sharding

This happens only when sequence parallel is on and ZeRO is off. We are rebuilding the seq parallel API with ring attention etc., so I've set it to False in...

[BUG]: HybridParallelOptimizer holds unsharded model parameters after sharding

This bug seems specific to a minority of TP plans. Will take another look

[BUG]: HybridParallelOptimizer holds unsharded model parameters after sharding

> A quick potential patch is not to use HF's `resize_token_embeddings` and use `nn.functional.pad` to resize tensor while avoiding recreation of `nn.Embedding` (not sure if there are other attributes that...

[BUG]: No module named 'colossalai.kernel.extensions'

> > Same problem I met. I use Open-Sora to train, stuck on this step. I notice there exsist a file called 'extensions' in folder ''colossalai/kerner', maybe that's a way...

[BUG]:MetaTensor :RuntimeError: !check_has_torch_dispatch(obj)

@yuanheng-zhao any insights?