Zhilin Yang comments

Results 39 comments of


Zhilin Yang

[Question]: Like bert xlnet also has a max_len of 512 tokens, what would be good way to process longer text

This is simple. Just increase the `max_sequence_length` to 640. I would suggest using XLNet as it's bidirectional.

Understanding rel_shift function

It's equivalent but could be faster.

xlnet consistantly underperformed Bert on language modeling

XLNet-Large has the same number of parameters as BERT-Large, while XLNet-Base has the same number of parameters as BERT-Base. I haven't looked at your code, though.

Fine Tuning - SQuAD 2.0 on GPU 8 GB

Nice results! Just added a [pointer](https://github.com/zihangdai/xlnet/commit/b4e33739b7df17af6f37a89af9a769a987711587) to your repo.

对xlnet预训练过程的一点疑问

@genggui001 That would take 85x more machines, which is almost impossible to train. Also, given 85x more machines, simply scaling up XLNet will probably be better due to better data...

Update main.py

How about ```parser.add_argument('--nocuda', action='store_false', dest='cuda', default=True)```

Some questions about pytorch code and details.

@wlhgtc - `def _rel_shift` does correspond to Appendix B. - You are right. `def _shift` is unused. - Relative positional encodings are defined on word pairs rather than a single...

Architecture for word-level Penn Treebank dataset

We decided not to include our PTB code in this repo because we believe PTB, being super small, is mainly a regularization game and is somewhat misleading for the development...

Some problems when training

Could you please provide more information about how you ran the code? If you use pytorch 0.4.1 and run it as is, it should be ok.

Google one-billion experiments

I think it's almost impossible to get competitive results with one gpu, unfortunately.