ast icon indicating copy to clipboard operation
ast copied to clipboard

Epoch: [4][160156/161048] training diverged...

Open xiaoli1996 opened this issue 2 years ago • 3 comments

Hi! Yaun Gong, Great job! I use the same hyperparameter by your GitHub code but when training "Epoch: [4][160156/161048]" appears "Train Loss is nan".

The results of the 3 epochs are: 0.415, 0.439, 0,447, Compare the results given in your log: 0.415, 0.439, 0,448, 0.449, 0.449

My torch version is 2.0.0, So why does this happen?

xiaoli1996 avatar Jun 16 '23 07:06 xiaoli1996

image

xiaoli1996 avatar Jun 16 '23 07:06 xiaoli1996

hi there,

The nan error can be due to an overflow/underflow - it is hard for me to identify the exact reason. It might be related to pytorch and hardware.

You could try two workarounds:

  • Run the experiment again and see if this error exists
  • We used a lower torch and torchaudio version at 2021. Please see https://github.com/YuanGongND/ast/blob/master/requirements.txt, you could try create a virtual environment with our version.

-Yuan

YuanGongND avatar Jun 16 '23 08:06 YuanGongND

Thanks for the suggestion, I will run it with a lower version of torch.

xiaoli1996 avatar Jun 16 '23 08:06 xiaoli1996