transformer-xl Loss jumping to 8.28 and not going down

Hello,

I'm trying to train a XL-model on a sub-word corpus of the Finnish Language, im facing a issue that the loss just jumps to 8 after going down to 4, attaching is a log from the restart point of my previous attempt, it still jumps to 8.28(i let my previous attempt complete). i dont use adaptive as my vocab is just 34K words, my corpus has 20 million sentences, iam training on 2 tesla V100's 32gb each, please let me know if there is a fix , or some parameter issue(have based my parameters on lm1b_large reducing the layers due to memory issues) log_xl.txt

Jul 16 '19 07:07 JainAbhilash

I encounter the same problem, may I ask how you solve it?

Oct 05 '19 07:10 ChenzhengUESTC

the problem might be occurring because of the aggressive gradient clipping, therefore i tried different hyperparameters till the loss was stable,lower number of attention heads,attention dimensions and also a higher warm-up, larger batch size and gradient accumulation brought my loss down considerably

Oct 07 '19 10:10 JainAbhilash

the problem might be occurring because of the aggressive gradient clipping, therefore i tried different hyperparameters till the loss was stable,lower number of attention heads,attention dimensions and also a higher warm-up, larger batch size and gradient accumulation brought my loss down considerably

Did you meet the situation that the ppl decreases at first, then will increases in train set?

Oct 28 '19 02:10 cmathx

the problem might be occurring because of the aggressive gradient clipping, therefore i tried different hyperparameters till the loss was stable,lower number of attention heads,attention dimensions and also a higher warm-up, larger batch size and gradient accumulation brought my loss down considerably

Did you meet the situation that the ppl decreases at first, then will increases in train set?

Yes, it increases such that it completely diverges and does not recover

Oct 28 '19 08:10 JainAbhilash

the problem might be occurring because of the aggressive gradient clipping, therefore i tried different hyperparameters till the loss was stable,lower number of attention heads,attention dimensions and also a higher warm-up, larger batch size and gradient accumulation brought my loss down considerably

Did you meet the situation that the ppl decreases at first, then will increases in train set?

Yes, it increases such that it completely diverges and does not recover

I reviewed all the issues. The author recommended that try using a larger warm up steps, reducing the learning rate, or setting div_val to 1.(issue #73 ) I think I cannot increase batch size (only set batch_size as 32) because of GPU memory limitation. Now I'm running this experiment in 1-billion corpus. How did you set the hyper parameters? Would you please give me some advices?

Oct 28 '19 09:10 cmathx

the problem might be occurring because of the aggressive gradient clipping, therefore i tried different hyperparameters till the loss was stable,lower number of attention heads,attention dimensions and also a higher warm-up, larger batch size and gradient accumulation brought my loss down considerably

Did you meet the situation that the ppl decreases at first, then will increases in train set?

Yes, it increases such that it completely diverges and does not recover

Yes, from the experiments I see the transformer-xl model is sensitive to the hyper parameters. How did you avoid this phenomenon？

Oct 28 '19 09:10 cmathx

transformer-xl transformer-xl copied to clipboard

Loss jumping to 8.28 and not going down

transformer-xl
transformer-xl copied to clipboard