Zhilin Yang

Results 39 comments of Zhilin Yang

This is simple. Just increase the `max_sequence_length` to 640. I would suggest using XLNet as it's bidirectional.

It's equivalent but could be faster.

XLNet-Large has the same number of parameters as BERT-Large, while XLNet-Base has the same number of parameters as BERT-Base. I haven't looked at your code, though.

Nice results! Just added a [pointer](https://github.com/zihangdai/xlnet/commit/b4e33739b7df17af6f37a89af9a769a987711587) to your repo.

@genggui001 That would take 85x more machines, which is almost impossible to train. Also, given 85x more machines, simply scaling up XLNet will probably be better due to better data...

How about ```parser.add_argument('--nocuda', action='store_false', dest='cuda', default=True)```

@wlhgtc - `def _rel_shift` does correspond to Appendix B. - You are right. `def _shift` is unused. - Relative positional encodings are defined on word pairs rather than a single...

We decided not to include our PTB code in this repo because we believe PTB, being super small, is mainly a regularization game and is somewhat misleading for the development...

Could you please provide more information about how you ran the code? If you use pytorch 0.4.1 and run it as is, it should be ok.

I think it's almost impossible to get competitive results with one gpu, unfortunately.