coot-videotext NAN error when training the TransformerXL model on yc2 dataset

NAN error when training the TransformerXL model on yc2 dataset

Open robert1015 opened this issue 3 years ago • 1 comments

I met this error when training the TransformerXL model on yc2 dataset.

Traceback (most recent call last):
  File "src/train.py", line 635, in <module>
    main()
  File "src/train.py", line 631, in main
    train(model, train_loader, val_loader, device, opt)
  File "src/train.py", line 329, in train
    model, training_data, optimizer, ema, device, opt, writer, epoch_i)
  File "src/train.py", line 130, in train_epoch
    loss.backward()
  File "/home/acb11598pe/anaconda3/envs/MART37/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/acb11598pe/anaconda3/envs/MART37/lib/python3.7/site-packages/torch/autograd/__init__.py", line 132, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function 'LogSoftmaxBackward' returned nan values in its 0th output.

I notice that you also add a debug code in the model.py to check if there is NAN appearing in the probability tensor. Could you please share the exact reason you found why this error happened? Thank you very much!

Oct 11 '21 14:10 robert1015

Hi, so first of all TransformerXL is not officially supported by this repo and not tested well. That being said, reasons for NaN can be:

Float16 training doesn't work well with the large gradients from the captioning loss. Try if it runs on regular Float32.
Gradient clipping was disabled in the config (it's enabled by default)
Try restarting several times with different seeds
Other than that you would have to debug backwards from the loss (check if the input to the loss has large absolute values) to the input and find out where the NaN exactly happens.

Oct 11 '21 16:10 simon-ging

coot-videotext coot-videotext copied to clipboard

NAN error when training the TransformerXL model on yc2 dataset

coot-videotext
coot-videotext copied to clipboard