transformer-xl
transformer-xl copied to clipboard
Training with wordpiece/bpe vocab
trafficstars
I am trying training with a fixed vocab(10k bpe symbols). I tried with auto-generated bpe vocab as well. The model doesn't converge. Are there any other considerations to be taken care of? Initially there was an issue with cutoffs, I made cutoffs=[], still facing the issue with model convergence.
This seems to be an issue of hyper-parameter tuning. Try using a larger warm up steps, reducing the learning rate, or setting div_val to 1.
I tried with div_val to 1 and smaller learning rate and the model is converging. But it seems to be overfitting, while the train pplx is ~23 after 45k steps eval pplx is around 210.