transformer-xl icon indicating copy to clipboard operation
transformer-xl copied to clipboard

Loss jumping to 8.28 and not going down

Open JainAbhilash opened this issue 6 years ago • 6 comments

Hello,

I'm trying to train a XL-model on a sub-word corpus of the Finnish Language, im facing a issue that the loss just jumps to 8 after going down to 4, attaching is a log from the restart point of my previous attempt, it still jumps to 8.28(i let my previous attempt complete). i dont use adaptive as my vocab is just 34K words, my corpus has 20 million sentences, iam training on 2 tesla V100's 32gb each, please let me know if there is a fix , or some parameter issue(have based my parameters on lm1b_large reducing the layers due to memory issues) log_xl.txt

JainAbhilash avatar Jul 16 '19 07:07 JainAbhilash

I encounter the same problem, may I ask how you solve it?

ChenzhengUESTC avatar Oct 05 '19 07:10 ChenzhengUESTC

the problem might be occurring because of the aggressive gradient clipping, therefore i tried different hyperparameters till the loss was stable,lower number of attention heads,attention dimensions and also a higher warm-up, larger batch size and gradient accumulation brought my loss down considerably

JainAbhilash avatar Oct 07 '19 10:10 JainAbhilash

the problem might be occurring because of the aggressive gradient clipping, therefore i tried different hyperparameters till the loss was stable,lower number of attention heads,attention dimensions and also a higher warm-up, larger batch size and gradient accumulation brought my loss down considerably

Did you meet the situation that the ppl decreases at first, then will increases in train set?

cmathx avatar Oct 28 '19 02:10 cmathx

the problem might be occurring because of the aggressive gradient clipping, therefore i tried different hyperparameters till the loss was stable,lower number of attention heads,attention dimensions and also a higher warm-up, larger batch size and gradient accumulation brought my loss down considerably

Did you meet the situation that the ppl decreases at first, then will increases in train set?

Yes, it increases such that it completely diverges and does not recover

JainAbhilash avatar Oct 28 '19 08:10 JainAbhilash

the problem might be occurring because of the aggressive gradient clipping, therefore i tried different hyperparameters till the loss was stable,lower number of attention heads,attention dimensions and also a higher warm-up, larger batch size and gradient accumulation brought my loss down considerably

Did you meet the situation that the ppl decreases at first, then will increases in train set?

Yes, it increases such that it completely diverges and does not recover

I reviewed all the issues. The author recommended that try using a larger warm up steps, reducing the learning rate, or setting div_val to 1.(issue #73 ) I think I cannot increase batch size (only set batch_size as 32) because of GPU memory limitation. Now I'm running this experiment in 1-billion corpus. How did you set the hyper parameters? Would you please give me some advices?

cmathx avatar Oct 28 '19 09:10 cmathx

the problem might be occurring because of the aggressive gradient clipping, therefore i tried different hyperparameters till the loss was stable,lower number of attention heads,attention dimensions and also a higher warm-up, larger batch size and gradient accumulation brought my loss down considerably

Did you meet the situation that the ppl decreases at first, then will increases in train set?

Yes, it increases such that it completely diverges and does not recover

Yes, from the experiments I see the transformer-xl model is sensitive to the hyper parameters. How did you avoid this phenomenon?

cmathx avatar Oct 28 '19 09:10 cmathx