Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[QUESTION] found NaN in local grad norm in backward pass before data-parallel communication collective

Open ftgreat opened this issue 1 year ago • 2 comments

During continuing training MoE models(loading existing ckpt), at some steps, assert errors occurred as follows: "found NaN in local grad norm in backward pass before data-parallel communication collective". https://github.com/NVIDIA/Megatron-LM/blob/caf2007e080d65dd7488be7bd409b366e225ab5f/megatron/core/distributed/param_and_grad_buffer.py#L115

Main Settings

  • tp=1,pp=8,ep=2
  • use_mcore=True
  • impl=transformers_engine
  • distributed_optimizer=True.

Questions

    1. At steps=A, an assert error occurred. however, resume training from latest ckpt, assert error would not happen at steps=A.(samples sequence is fixed). Besides, during resume training process, except loss at the very first step, losses of all subsequent steps have tiny numeric differences. Could you explain the reasons?
    1. How to figure out the above NaN error, could you give me some advice to debugging details? Thanks.

ftgreat avatar Apr 16 '24 05:04 ftgreat

I got a same error, when I use Megatron training deepseek model on SFT. so any body know what's the problem

D1026 avatar Apr 19 '24 02:04 D1026

I got a same error, when I use Megatron training deepseek model on SFT. so any body know what's the problem

@D1026 did you train deepseek dense model or deepseek-moe model? Often this error happened due to data. However, in my case, data seems ok. I am not sure whether this case is related to moe pretraining.

ftgreat avatar Apr 20 '24 11:04 ftgreat

Same issue!

980202006 avatar May 27 '24 02:05 980202006

Some zero data caused it!

980202006 avatar Jun 03 '24 03:06 980202006

Some zero data caused it!

@980202006 could you explain this root cause and what is zero data like, thanks.

ftgreat avatar Jun 05 '24 10:06 ftgreat

Some zero data caused it!

@980202006 could you explain this root cause and what is zero data like, thanks.

Hello, have you resolved this issue?

lintao-common avatar Jul 08 '24 14:07 lintao-common

Some zero data caused it!

@980202006 could you explain this root cause and what is zero data like, thanks.

Met the same issue, have you resolved it?

Yifei-Zuo avatar Jul 22 '24 08:07 Yifei-Zuo

Any idea about this issue? I get the same question.

MangoFF avatar Jul 27 '24 08:07 MangoFF

same issue

1195343015 avatar Oct 06 '24 07:10 1195343015