Loss is nan, stopping training
使用自己的数据集训练,报错 Loss is nan, stopping training
Hello, I was also getting a "Loss is nan" error that would randomly stop my training. I believe I have found a specific fix, especially for those of you using Automatic Mixed Precision (--use-amp).
The problem appears to be the default epsilon value (eps=1e-8) in the AdamW optimizer. When using float16 precision with AMP, this value can be too small and cause a division-by-zero error during the optimizer's update step, which results in NaN values.
I was able to fix this and get my training to run stably by setting a slightly larger eps (1e-7) in my .yml config file:
optimizer:
type: AdamW
# ... other params
eps: 0.0000001
This prevents the numerical instability. There's a more detailed discussion about this specific problem in issue [#72], and the underlying PyTorch issue is documented here: pytorch/pytorch#26218.
Hope this helps!
Thank you very much for your interest in and attention to our work. @EwertzJN Nice!