transformer-xl
transformer-xl copied to clipboard
Is there a pre-training model in Chinese
Hello! Could you, please, provide hyperparameters for training models with close to SOTA perplexity on PTB and WT2 (if you experimented with the latter, as it has the corresponding choice...
Hi, Can you provide sample code for fine-tuning Transformer-XL for classification task (Just like BERT)? Thanks!
Hi, I am using the eval script by changing the `tgt_len` parameter. It ideally changes the "number of tokens to predict" and even changes the way data is pre-processed. However,...
Hi, thanks for your excellent work. Transformer-xl is the most elegant model for long sequences by now. Do you plan to finetune pretrained models for document classification just like Bert?
Hi authors, Thanks for sharing this great code. I am wondering why the input index of positional embedding, pos_seq, needs to be in descending order as [klen-1, ..., 1, 0],...
TRAIN_BSZ=64 is used in text8_large_tpu.sh. During training data preparation it is used as "--per_host_train_bsz=${TRAIN_BSZ}" During training it is used as "--train_batch_size=${TRAIN_BSZ}" and when calling data_utils.get_input_fn() it is used as "per_host_bsz=FLAGS.train_batch_size...
Hi, thank you for the open source. After I finished running run_wt103_base.sh with 4 GPUs, my final result looked like this: ``` | Eval 50 at step 200000 | time:...
In https://github.com/kimiyoung/transformer-xl/blob/master/pytorch/mem_transformer.py#L105 , I am wondering why only `key` and `value` are layernormed, while `query` is not normed? In other variants (such as RelMultiHeadAttn), the qkv computation is implemented by...
`mlen = mems[0].size(0) if mems is not None else 0 klen = mlen + qlen if self.same_length: all_ones = word_emb.new_ones(qlen, klen) mask_len = klen - self.mem_len if mask_len > 0:...