Megatron-LM
Megatron-LM copied to clipboard
MOE training Loss inconsistent after resume from old checkpoint
MOE training Loss inconsistent after resume from old checkpoint Experimental conditions:
- the latest main branch
- use mcore
- expert-model-parallel-size > 1
The black line runs continuously and saves a ckpt every 100 steps. The blue line is at the 100th step. ckpt is loaded and continues running. What might cause this and how to fix it?
Hi @guozhen1997 , we are also debugging on this issue. I will ping you when we find the root cause ASAP.
Hi @guozhen1997 , this issue is caused by an incorrect implementation of the dual-optimizer state loading function, the fix MR is under review and will be published soon.
Hi @fanshiqing , if we use the lagecy checkpointing method instead of the distributed checkpointing will we encounter this issue?
Hi @guozhen1997 and @binxuan , this issue has already been fixed by this commit.
Marking as stale. No activity in 60 days.