deepcoldfish

Results 1 issues of deepcoldfish

**Env:** 16GPUs + llama2 pretrain+ megatron-lm **strategy:** TP 8 + PP 1 + DP 2 **case:** when killing a training proceess to retrigger fault-tollerence with megatron-distributed flash-checkpoint,the dp 1 group...

investigating