dlrover while using megatron distributed flash-checkpoint to recovery, error ocurs when load

Env: 16GPUs + llama2 pretrain+ megatron-lm strategy: TP 8 + PP 1 + DP 2 case: when killing a training proceess to retrigger fault-tollerence with megatron-distributed flash-checkpoint，the dp 1 group load_checkpoint failed with the following log,

WARNING: on rank 11 found iteration 15 in the metadata while max iteration across the ranks is 4160813071, replacing it with max iteration.WARNING: on rank 10 found iteration 15 in the metadata while max iteration across the ranks is 4160813071, replacing it with max iteration.WARNING: on rank 14 found iteration 15 in the metadata while max iteration across the ranks is 4160813071, replacing it with max iteration.

The reason is that dp 1 group load checkpoint from storage for no model in memory and uses allreduce when read_metadata, meanwhile dp 0 group only load from memory.

Aug 13 '24 01:08 deepcoldfish

Can u provide more information? The more detailed, the better. e.g. Detail of killing. (failed cp step?, load cp step after failover?)

Oct 12 '24 02:10 BalaBalaYi

Can u provide more information? The more detailed, the better. e.g. Detail of killing. (failed cp step?, load cp step after failover?)

When training after save checkpoint to memory or storage, kill a training process (in node 1) to retrigger the restart of training cluster.

After restart, all node will recovery from memory.

When dp rank !=0, model_state_dict is empty , and will go to here and read_metadata here. Nodes with dp_rank = 0, have model_state_dict in memory , and will not go to this branch.

read_metadata will trigger global sync among all node group, and will cause step failing out.

Nov 12 '24 11:11 deepcoldfish

ur code version please(commit id)?

Nov 18 '24 11:11 BalaBalaYi

while using megatron distributed flash-checkpoint to recovery, error ocurs when load_checkpoint