Megatron-LM [QUESTION] found NaN in local grad norm in backward pass before data-parallel communication collective

[QUESTION] found NaN in local grad norm in backward pass before data-parallel communication collective

Open ftgreat opened this issue 1 year ago • 2 comments

During continuing training MoE models(loading existing ckpt), at some steps, assert errors occurred as follows: "found NaN in local grad norm in backward pass before data-parallel communication collective". https://github.com/NVIDIA/Megatron-LM/blob/caf2007e080d65dd7488be7bd409b366e225ab5f/megatron/core/distributed/param_and_grad_buffer.py#L115

Main Settings

tp=1,pp=8,ep=2
use_mcore=True
impl=transformers_engine
distributed_optimizer=True.

Questions

1. At steps=A, an assert error occurred. however, resume training from latest ckpt, assert error would not happen at steps=A.(samples sequence is fixed). Besides, during resume training process, except loss at the very first step, losses of all subsequent steps have tiny numeric differences. Could you explain the reasons?
1. How to figure out the above NaN error, could you give me some advice to debugging details? Thanks.

Apr 16 '24 05:04 ftgreat

I got a same error, when I use Megatron training deepseek model on SFT. so any body know what's the problem

Apr 19 '24 02:04 D1026

I got a same error, when I use Megatron training deepseek model on SFT. so any body know what's the problem

@D1026 did you train deepseek dense model or deepseek-moe model? Often this error happened due to data. However, in my case, data seems ok. I am not sure whether this case is related to moe pretraining.

Apr 20 '24 11:04 ftgreat

Same issue!