transformer-xl
transformer-xl copied to clipboard
Loss jumping to 8.28 and not going down
Hello,
I'm trying to train a XL-model on a sub-word corpus of the Finnish Language, im facing a issue that the loss just jumps to 8 after going down to 4, attaching is a log from the restart point of my previous attempt, it still jumps to 8.28(i let my previous attempt complete). i dont use adaptive as my vocab is just 34K words, my corpus has 20 million sentences, iam training on 2 tesla V100's 32gb each, please let me know if there is a fix , or some parameter issue(have based my parameters on lm1b_large reducing the layers due to memory issues) log_xl.txt
I encounter the same problem, may I ask how you solve it?
the problem might be occurring because of the aggressive gradient clipping, therefore i tried different hyperparameters till the loss was stable,lower number of attention heads,attention dimensions and also a higher warm-up, larger batch size and gradient accumulation brought my loss down considerably
the problem might be occurring because of the aggressive gradient clipping, therefore i tried different hyperparameters till the loss was stable,lower number of attention heads,attention dimensions and also a higher warm-up, larger batch size and gradient accumulation brought my loss down considerably
Did you meet the situation that the ppl decreases at first, then will increases in train set?
the problem might be occurring because of the aggressive gradient clipping, therefore i tried different hyperparameters till the loss was stable,lower number of attention heads,attention dimensions and also a higher warm-up, larger batch size and gradient accumulation brought my loss down considerably
Did you meet the situation that the ppl decreases at first, then will increases in train set?
Yes, it increases such that it completely diverges and does not recover
the problem might be occurring because of the aggressive gradient clipping, therefore i tried different hyperparameters till the loss was stable,lower number of attention heads,attention dimensions and also a higher warm-up, larger batch size and gradient accumulation brought my loss down considerably
Did you meet the situation that the ppl decreases at first, then will increases in train set?
Yes, it increases such that it completely diverges and does not recover
I reviewed all the issues. The author recommended that try using a larger warm up steps, reducing the learning rate, or setting div_val to 1.(issue #73 ) I think I cannot increase batch size (only set batch_size as 32) because of GPU memory limitation. Now I'm running this experiment in 1-billion corpus. How did you set the hyper parameters? Would you please give me some advices?
the problem might be occurring because of the aggressive gradient clipping, therefore i tried different hyperparameters till the loss was stable,lower number of attention heads,attention dimensions and also a higher warm-up, larger batch size and gradient accumulation brought my loss down considerably
Did you meet the situation that the ppl decreases at first, then will increases in train set?
Yes, it increases such that it completely diverges and does not recover
Yes, from the experiments I see the transformer-xl model is sensitive to the hyper parameters. How did you avoid this phenomenon?