DEIM icon indicating copy to clipboard operation
DEIM copied to clipboard

Loss is nan, stopping training

Open Qinger27 opened this issue 7 months ago • 2 comments

使用自己的数据集训练,报错 Loss is nan, stopping training

Qinger27 avatar Jun 03 '25 02:06 Qinger27

Hello, I was also getting a "Loss is nan" error that would randomly stop my training. I believe I have found a specific fix, especially for those of you using Automatic Mixed Precision (--use-amp).

The problem appears to be the default epsilon value (eps=1e-8) in the AdamW optimizer. When using float16 precision with AMP, this value can be too small and cause a division-by-zero error during the optimizer's update step, which results in NaN values.

I was able to fix this and get my training to run stably by setting a slightly larger eps (1e-7) in my .yml config file:

optimizer:
  type: AdamW
  # ... other params
  eps: 0.0000001

This prevents the numerical instability. There's a more detailed discussion about this specific problem in issue [#72], and the underlying PyTorch issue is documented here: pytorch/pytorch#26218.

Hope this helps!

EwertzJN avatar Jul 31 '25 09:07 EwertzJN

Thank you very much for your interest in and attention to our work. @EwertzJN Nice!

ShihuaHuang95 avatar Nov 01 '25 00:11 ShihuaHuang95