openspeech
openspeech copied to clipboard
Loss becomes NAN after a while
Environment info
- Platform: Ubuntu20.04
- Python version: 3.8.10
- PyTorch version (GPU?): 1.9.0+cu111 (pytorch lightning 1.5.8)
- Using GPU in script?: 4x A100
Information
Model I am using (ListenAttendSpell, Transformer, Conformer ...): conformer_lstm
The problem arises when using:
- [ ] the official example scripts: (give details below)
- [x] my own modified scripts: (give details below)
To reproduce
Steps to reproduce the behavior: I can't seem to reproduce it in the example dataset
Expected behavior
The model is training on a very large dataset. A priori everything seems to be behaving correctly: loss, wer, cer going down and such. However, all of a sudden, the loss randomly goes to NAN from which is impossible to recover. Do you guys have any ideas or suggestions?
I added a snipped that zeros the gradients in case of nan loss, so as not to update the model. However, the nans start to appear more frequently as training progresses. Effectively rendering useless most of the training steps.
Can you attach the log? cc. @upskyy
Hey, after some experimenting I managed to avoid it.
I'm not 100% sure but I believe the issue was training with 16-bit precision. Do you think this might be the case?
In my case, it happened with 32bit precision