dlrover
dlrover copied to clipboard
[Error] When using deepspeed to start a megatron training task, only rank 0 of the flash checkpoint saves the model
Thanks for amazing work to accelerate distributed training. When I use 'deepspeed train.py' to start megatron-lm train task, I get this log
It seems only rank 0 saved the model weight to shared memory, so when save to disk it's blocked. But when I use dlrover-run to start the training task, the flash checkpoint saves the model weights normally.
Using Megatron-LM 0.6.0 and dlrover 0.3.6rc0
You can check whether other ranks have the non-empty state dict when calling save_checkpoint.
This issue has been automatically marked as stale because it has not had recent activity.
This issue is being automatically closed due to inactivity.
您好,请问你只是用了flash checkpoint功能吗?我想知道dlrover-run可以跟deepspeed一起用吗 @liangxuZhang