dlrover icon indicating copy to clipboard operation
dlrover copied to clipboard

[Error] When using deepspeed to start a megatron training task, only rank 0 of the flash checkpoint saves the model

Open liangxuZhang opened this issue 1 year ago • 1 comments

Thanks for amazing work to accelerate distributed training. When I use 'deepspeed train.py' to start megatron-lm train task, I get this log image It seems only rank 0 saved the model weight to shared memory, so when save to disk it's blocked. But when I use dlrover-run to start the training task, the flash checkpoint saves the model weights normally. image

Using Megatron-LM 0.6.0 and dlrover 0.3.6rc0

liangxuZhang avatar Jul 17 '24 09:07 liangxuZhang

You can check whether other ranks have the non-empty state dict when calling save_checkpoint.

workingloong avatar Jul 19 '24 01:07 workingloong

This issue has been automatically marked as stale because it has not had recent activity.

github-actions[bot] avatar Oct 18 '24 01:10 github-actions[bot]

This issue is being automatically closed due to inactivity.

github-actions[bot] avatar Oct 25 '24 01:10 github-actions[bot]

您好,请问你只是用了flash checkpoint功能吗?我想知道dlrover-run可以跟deepspeed一起用吗 @liangxuZhang

TomSuen avatar Oct 31 '24 03:10 TomSuen