SLAM-LLM
SLAM-LLM copied to clipboard
NCCL error when saving with DDP
System Info
8*A100 with docker enviroment
Information
- [x] The official example scripts
- [ ] My own modified scripts
🐛 Describe the bug
training always abort after saving the checkpoint for 249999th step, I presume the model saving process in rank 0 disrupts the nccl communication somehow. According to logs ,the saving process is no where near the time out threshold of nccl(which should be 30min by default). Any advice on how to resolve this issue would be helpful!
Error logs
Expected behavior
nccl timeout error after a certain steps of training
Same problem. Do you have any solution?
same problem too, Do you have any solution?
Same problem. Do you have any solution?