Megatron-LM
Megatron-LM copied to clipboard
[QUESTION] found NaN in local grad norm in backward pass before data-parallel communication collective
During continuing training MoE models(loading existing ckpt), at some steps, assert errors occurred as follows: "found NaN in local grad norm in backward pass before data-parallel communication collective". https://github.com/NVIDIA/Megatron-LM/blob/caf2007e080d65dd7488be7bd409b366e225ab5f/megatron/core/distributed/param_and_grad_buffer.py#L115
Main Settings
- tp=1,pp=8,ep=2
- use_mcore=True
- impl=transformers_engine
- distributed_optimizer=True.
Questions
-
- At steps=A, an assert error occurred. however, resume training from latest ckpt, assert error would not happen at steps=A.(samples sequence is fixed). Besides, during resume training process, except loss at the very first step, losses of all subsequent steps have tiny numeric differences. Could you explain the reasons?
-
- How to figure out the above NaN error, could you give me some advice to debugging details? Thanks.
I got a same error, when I use Megatron training deepseek model on SFT. so any body know what's the problem
I got a same error, when I use Megatron training deepseek model on SFT. so any body know what's the problem
@D1026 did you train deepseek dense model or deepseek-moe model? Often this error happened due to data. However, in my case, data seems ok. I am not sure whether this case is related to moe pretraining.
Same issue!
Some zero data caused it!
Some zero data caused it!
@980202006 could you explain this root cause and what is zero data like, thanks.
Some zero data caused it!
@980202006 could you explain this root cause and what is zero data like, thanks.
Hello, have you resolved this issue?
Some zero data caused it!
@980202006 could you explain this root cause and what is zero data like, thanks.
Met the same issue, have you resolved it?
Any idea about this issue? I get the same question.
same issue