transformer-xl
transformer-xl copied to clipboard
Penn Treebank and WikiText-2 architectures
Hello!
Could you, please, provide hyperparameters for training models with close to SOTA perplexity on PTB and WT2 (if you experimented with the latter, as it has the corresponding choice in data utils)? Am I right that two changes I need to make to the released code is to add variational dropout and ASGD optimizer? If you have a code which produces necessary changes, it would be great.
Thanks
Did you find hyperparams for PTB? I only reached 68 in test without variational dropout and weight averaging. But only with 14m params.