deepcoldfish
Results
1
issues of
deepcoldfish
**Env:** 16GPUs + llama2 pretrain+ megatron-lm **strategy:** TP 8 + PP 1 + DP 2 **case:** when killing a training proceess to retrigger fault-tollerence with megatron-distributed flash-checkpoint,the dp 1 group...
investigating