xlnet
xlnet copied to clipboard
Why the max_seq_length = 512 for XLNet?
Hi,
Just a conceptual question: In the paper, it is mentioned that XLNet derives some parts from Transformer-XL which isn't limited to a fixed context but the hyperparameters section says that the max length is 512.
Can you please help me better understand it?
Thanks!
I was having the same question. @zihangdai could you please help us with this?
or maybe @kimiyoung ?
Assuming you are familiar with Transformer-XL, max_seq_length
means the length of each training segment where you can back-prop (as the gradient does not pass to the memory in Transformer-XL).
Then, why the value 512?
(1) Longer sequence requires more pretraining time.
(2) Most of the tasks considered at that time do not really require handling long sequences: GLUE -> 128, SQuAD -> 512. RACE performance can be improved slightly if you also increase max_seq_length
during finetuning. Technically, you can increase the sequence length if you want during finetuning. But if it's too long, the generalization may not be good as longer sequences are not seen during pretraining.
@zihangdai thank you for your fast reply!