Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

MOE training Loss inconsistent after resume from old checkpoint

Open guozhen1997 opened this issue 10 months ago • 4 comments

MOE training Loss inconsistent after resume from old checkpoint Experimental conditions:

  • the latest main branch
  • use mcore
  • expert-model-parallel-size > 1
training loss

The black line runs continuously and saves a ckpt every 100 steps. The blue line is at the 100th step. ckpt is loaded and continues running. What might cause this and how to fix it?

guozhen1997 avatar Apr 01 '24 02:04 guozhen1997

Hi @guozhen1997 , we are also debugging on this issue. I will ping you when we find the root cause ASAP.

fanshiqing avatar Apr 01 '24 04:04 fanshiqing

Hi @guozhen1997 , this issue is caused by an incorrect implementation of the dual-optimizer state loading function, the fix MR is under review and will be published soon.

fanshiqing avatar Apr 01 '24 12:04 fanshiqing

Hi @fanshiqing , if we use the lagecy checkpointing method instead of the distributed checkpointing will we encounter this issue?

binxuan avatar Apr 03 '24 20:04 binxuan

Hi @guozhen1997 and @binxuan , this issue has already been fixed by this commit.

fanshiqing avatar Apr 04 '24 14:04 fanshiqing

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Jun 03 '24 18:06 github-actions[bot]