transformer-xl issues

Is there a pre-training model in Chinese

5

Is there a pre-training model in Chinese

Penn Treebank and WikiText-2 architectures

1

Hello! Could you, please, provide hyperparameters for training models with close to SOTA perplexity on PTB and WT2 (if you experimented with the latter, as it has the corresponding choice...

AlexGrinch

fine-tune text classification?

Hi, Can you provide sample code for fine-tuning Transformer-XL for classification task (Just like BERT)? Thanks!

vr25

Perplexity not changes with tgt_len

Hi, I am using the eval script by changing the `tgt_len` parameter. It ideally changes the "number of tokens to predict" and even changes the way data is pre-processed. However,...

bajajahsaas

Finetune with transformer-xl pretrained models

5

Hi, thanks for your excellent work. Transformer-xl is the most elegant model for long sequences by now. Do you plan to finetune pretrained models for document classification just like Bert?

mudong0419

Why pos_seq is in descending order as the input of positional embedding?

2

Hi authors, Thanks for sharing this great code. I am wondering why the input index of positional embedding, pos_seq, needs to be in descending order as [klen-1, ..., 1, 0],...

GrindstoneLZX

question on TRAIN_BSZ used in tf/scripts/text8_large_tpu.sh

TRAIN_BSZ=64 is used in text8_large_tpu.sh. During training data preparation it is used as "--per_host_train_bsz=${TRAIN_BSZ}" During training it is used as "--train_batch_size=${TRAIN_BSZ}" and when calling data_utils.get_input_fn() it is used as "per_host_bsz=FLAGS.train_batch_size...

lelouchmatlab

Result of wt103_base

3

Hi, thank you for the open source. After I finished running run_wt103_base.sh with 4 GPUs, my final result looked like this: ``` | Eval 50 at step 200000 | time:...

stylelohan

qkv computation

In https://github.com/kimiyoung/transformer-xl/blob/master/pytorch/mem_transformer.py#L105 , I am wondering why only `key` and `value` are layernormed, while `query` is not normed? In other variants (such as RelMultiHeadAttn), the qkv computation is implemented by...

donglixp

what if mems is None?

`mlen = mems[0].size(0) if mems is not None else 0 klen = mlen + qlen if self.same_length: all_ones = word_emb.new_ones(qlen, klen) mask_len = klen - self.mem_len if mask_len > 0:...

LindgeW

transformer-xl
transformer-xl copied to clipboard

Metadata

Is there a pre-training model in Chinese

Penn Treebank and WikiText-2 architectures

fine-tune text classification?

Perplexity not changes with tgt_len

Finetune with transformer-xl pretrained models

Why pos_seq is in descending order as the input of positional embedding?

question on TRAIN_BSZ used in tf/scripts/text8_large_tpu.sh

Result of wt103_base

qkv computation

what if mems is None?

← Metadata

Owner

Metadata

transformer-xl transformer-xl copied to clipboard

Metadata

← Metadata

Owner

Metadata

transformer-xl
transformer-xl copied to clipboard