openspeech icon indicating copy to clipboard operation
openspeech copied to clipboard

Loss becomes NAN after a while

Open OleguerCanal opened this issue 3 years ago • 3 comments

Environment info

  • Platform: Ubuntu20.04
  • Python version: 3.8.10
  • PyTorch version (GPU?): 1.9.0+cu111 (pytorch lightning 1.5.8)
  • Using GPU in script?: 4x A100

Information

Model I am using (ListenAttendSpell, Transformer, Conformer ...): conformer_lstm

The problem arises when using:

  • [ ] the official example scripts: (give details below)
  • [x] my own modified scripts: (give details below)

To reproduce

Steps to reproduce the behavior: I can't seem to reproduce it in the example dataset

Expected behavior

The model is training on a very large dataset. A priori everything seems to be behaving correctly: loss, wer, cer going down and such. However, all of a sudden, the loss randomly goes to NAN from which is impossible to recover. Do you guys have any ideas or suggestions?

I added a snipped that zeros the gradients in case of nan loss, so as not to update the model. However, the nans start to appear more frequently as training progresses. Effectively rendering useless most of the training steps.

OleguerCanal avatar Jan 22 '22 10:01 OleguerCanal

Can you attach the log? cc. @upskyy

sooftware avatar Feb 01 '22 13:02 sooftware

Hey, after some experimenting I managed to avoid it.

I'm not 100% sure but I believe the issue was training with 16-bit precision. Do you think this might be the case?

OleguerCanal avatar Feb 04 '22 15:02 OleguerCanal

In my case, it happened with 32bit precision

JaeungHyun avatar May 05 '22 11:05 JaeungHyun