dlrover icon indicating copy to clipboard operation
dlrover copied to clipboard

Incomplete save of ckpt files

Open husky23333 opened this issue 1 month ago • 0 comments

I am using dlrover on Megatron-DeepSpeed,and my machine has 4 GPUs. The hybrid parallel settings are as follows, TP:[0,1],[2,3] DP:[0,2],[1,3] At the same time, I also configured DeepSpeed with Zero 1. The saving status of ckpt files are as follows, dlrover-deepspeed

Normally, ckpt files include these, image

layer_*-model_states.pt and zero_pp_rank_1_*optim_states.pt are missing

husky23333 avatar May 21 '24 01:05 husky23333