TexasRangers86 issues

Results 3 issues of


                                            TexasRangers86

[BUG]: Got a nccl timeout when use save_checkpoint function

### 🐛 Describe the bug When I use 1d tp and set the size as n > 1, I got a nccl timeout. opt_config = dict(parallel=dict(tensor=dict(mode='1d', size=8)), fp16=dict(mode=AMP_TYPE.TORCH)) colossalai.launch_from_torch(config=opt_config) But...

bug

[BUG]: The embedding weight don't assignment when I used geminidpp

### 🐛 Describe the bug It seems that the embedding weight don't assignment when I package the model with geminidpp. The model works when I init with from_pretrained function, but...

bug

0.9.2版本训练deepseek3问题

### Reminder - [x] I have read the above rules and searched the existing issues. ### System Info 问题描述使用0.9.2版本库训练deepseek3会卡住在这个位置，然后更换qwen 7b模型训练发现也卡在这个位置，之后使用之前的容器环境0.9.1版本，相同启动命令和配置文件，是可以正常训练的；目前怀疑是环境中多机交互相关库，如deepspeed等版本问题，请问0.9.2更新的版本训练deepseek3，我的依赖库环境版本有问题吗 [WARNING] async_io requires the dev libaio .so object and...

bug

pending