ast
ast copied to clipboard
Epoch: [4][160156/161048] training diverged...
Hi! Yaun Gong, Great job! I use the same hyperparameter by your GitHub code but when training "Epoch: [4][160156/161048]" appears "Train Loss is nan".
The results of the 3 epochs are: 0.415, 0.439, 0,447, Compare the results given in your log: 0.415, 0.439, 0,448, 0.449, 0.449
My torch version is 2.0.0, So why does this happen?
hi there,
The nan error can be due to an overflow/underflow - it is hard for me to identify the exact reason. It might be related to pytorch and hardware.
You could try two workarounds:
- Run the experiment again and see if this error exists
- We used a lower torch and torchaudio version at 2021. Please see https://github.com/YuanGongND/ast/blob/master/requirements.txt, you could try create a virtual environment with our version.
-Yuan
Thanks for the suggestion, I will run it with a lower version of torch.