TexasRangers86
TexasRangers86
### 🐛 Describe the bug When I use 1d tp and set the size as n > 1, I got a nccl timeout. opt_config = dict(parallel=dict(tensor=dict(mode='1d', size=8)), fp16=dict(mode=AMP_TYPE.TORCH)) colossalai.launch_from_torch(config=opt_config) But...
### 🐛 Describe the bug It seems that the embedding weight don't assignment when I package the model with geminidpp. The model works when I init with from_pretrained function, but...
### Reminder - [x] I have read the above rules and searched the existing issues. ### System Info 问题描述 使用0.9.2版本库训练deepseek3会卡住在这个位置,然后更换qwen 7b模型训练发现也卡在这个位置,之后使用之前的容器环境0.9.1版本,相同启动命令和配置文件,是可以正常训练的;目前怀疑是环境中多机交互相关库,如deepspeed等版本问题,请问0.9.2更新的版本训练deepseek3,我的依赖库环境版本有问题吗 [WARNING] async_io requires the dev libaio .so object and...