dlrover
dlrover copied to clipboard
Incomplete save of ckpt files
I am using dlrover on Megatron-DeepSpeed,and my machine has 4 GPUs. The hybrid parallel settings are as follows, TP:[0,1],[2,3] DP:[0,2],[1,3] At the same time, I also configured DeepSpeed with Zero 1. The saving status of ckpt files are as follows,
Normally, ckpt files include these,
layer_*-model_states.pt and zero_pp_rank_1_*optim_states.pt are missing