Megatron-LM MOE training Loss inconsistent after resume from old checkpoint

MOE training Loss inconsistent after resume from old checkpoint

Open guozhen1997 opened this issue 10 months ago • 4 comments

MOE training Loss inconsistent after resume from old checkpoint Experimental conditions:

the latest main branch
use mcore
expert-model-parallel-size > 1

The black line runs continuously and saves a ckpt every 100 steps. The blue line is at the 100th step. ckpt is loaded and continues running. What might cause this and how to fix it？

Apr 01 '24 02:04 guozhen1997

Hi @guozhen1997 , we are also debugging on this issue. I will ping you when we find the root cause ASAP.

Apr 01 '24 04:04 fanshiqing

Hi @guozhen1997 , this issue is caused by an incorrect implementation of the dual-optimizer state loading function, the fix MR is under review and will be published soon.

Apr 01 '24 12:04 fanshiqing

Hi @fanshiqing , if we use the lagecy checkpointing method instead of the distributed checkpointing will we encounter this issue?

Apr 03 '24 20:04 binxuan

Hi @guozhen1997 and @binxuan , this issue has already been fixed by this commit.

Apr 04 '24 14:04 fanshiqing

Marking as stale. No activity in 60 days.

Jun 03 '24 18:06 github-actions[bot]

Megatron-LM Megatron-LM copied to clipboard

MOE training Loss inconsistent after resume from old checkpoint

Megatron-LM
Megatron-LM copied to clipboard