coot-videotext icon indicating copy to clipboard operation
coot-videotext copied to clipboard

NAN error when training the TransformerXL model on yc2 dataset

Open robert1015 opened this issue 3 years ago • 1 comments

I met this error when training the TransformerXL model on yc2 dataset.

Traceback (most recent call last):
  File "src/train.py", line 635, in <module>
    main()
  File "src/train.py", line 631, in main
    train(model, train_loader, val_loader, device, opt)
  File "src/train.py", line 329, in train
    model, training_data, optimizer, ema, device, opt, writer, epoch_i)
  File "src/train.py", line 130, in train_epoch
    loss.backward()
  File "/home/acb11598pe/anaconda3/envs/MART37/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/acb11598pe/anaconda3/envs/MART37/lib/python3.7/site-packages/torch/autograd/__init__.py", line 132, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function 'LogSoftmaxBackward' returned nan values in its 0th output.

I notice that you also add a debug code in the model.py to check if there is NAN appearing in the probability tensor. Could you please share the exact reason you found why this error happened? Thank you very much!

robert1015 avatar Oct 11 '21 14:10 robert1015

Hi, so first of all TransformerXL is not officially supported by this repo and not tested well. That being said, reasons for NaN can be:

  • Float16 training doesn't work well with the large gradients from the captioning loss. Try if it runs on regular Float32.
  • Gradient clipping was disabled in the config (it's enabled by default)
  • Try restarting several times with different seeds
  • Other than that you would have to debug backwards from the loss (check if the input to the loss has large absolute values) to the input and find out where the NaN exactly happens.

simon-ging avatar Oct 11 '21 16:10 simon-ging