Jonah Noh

Results 1 comments of Jonah Noh

I am also seeing inconsistent loss values when resuming from checkpoint in mcore-v0.12. This is with megatron's distributed checkpointing format `--ckpt-format=torch_dist`. If I set `--ckpt-format=torch`, I am able to resume...