coot-videotext
coot-videotext copied to clipboard
NAN error when training the TransformerXL model on yc2 dataset
I met this error when training the TransformerXL model on yc2 dataset.
Traceback (most recent call last):
File "src/train.py", line 635, in <module>
main()
File "src/train.py", line 631, in main
train(model, train_loader, val_loader, device, opt)
File "src/train.py", line 329, in train
model, training_data, optimizer, ema, device, opt, writer, epoch_i)
File "src/train.py", line 130, in train_epoch
loss.backward()
File "/home/acb11598pe/anaconda3/envs/MART37/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/acb11598pe/anaconda3/envs/MART37/lib/python3.7/site-packages/torch/autograd/__init__.py", line 132, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Function 'LogSoftmaxBackward' returned nan values in its 0th output.
I notice that you also add a debug code in the model.py to check if there is NAN appearing in the probability tensor. Could you please share the exact reason you found why this error happened? Thank you very much!
Hi, so first of all TransformerXL is not officially supported by this repo and not tested well. That being said, reasons for NaN can be:
- Float16 training doesn't work well with the large gradients from the captioning loss. Try if it runs on regular Float32.
- Gradient clipping was disabled in the config (it's enabled by default)
- Try restarting several times with different seeds
- Other than that you would have to debug backwards from the loss (check if the input to the loss has large absolute values) to the input and find out where the NaN exactly happens.