SLAM-LLM icon indicating copy to clipboard operation
SLAM-LLM copied to clipboard

NCCL error when saving with DDP

Open Vindicator645 opened this issue 1 year ago • 3 comments

System Info

8*A100 with docker enviroment

Information

  • [x] The official example scripts
  • [ ] My own modified scripts

🐛 Describe the bug

training always abort after saving the checkpoint for 249999th step, I presume the model saving process in rank 0 disrupts the nccl communication somehow. According to logs ,the saving process is no where near the time out threshold of nccl(which should be 30min by default). Any advice on how to resolve this issue would be helpful!

Error logs

image

Expected behavior

nccl timeout error after a certain steps of training

Vindicator645 avatar Jul 01 '24 06:07 Vindicator645

Same problem. Do you have any solution?

cnlinxi avatar Jul 09 '24 03:07 cnlinxi

same problem too, Do you have any solution?

zhangron013 avatar Sep 02 '24 12:09 zhangron013

Same problem. Do you have any solution?

CellBro avatar Sep 28 '25 10:09 CellBro