xlnet [Question]: Like bert xlnet also has a max_len of 512 tokens, what would be good way to process longer text

[Question]: Like bert xlnet also has a max_len of 512 tokens, what would be good way to process longer text

Open kapilkd13 opened this issue 5 years ago • 8 comments

I want to compare docvecs obtained from bert with XLnet. In case of bert I take average of last 4 layers to obtain docvec but the max seq is limited to 512 tokens. As I read in XLnet also the seq-len is limited to 512 tokens. Is there a easier/intuitive way to process longer text. Since XLnet considers all the permutation, I can understand the memory overhead with large text. But is there a way? any suggestions are welcome.

Jun 28 '19 09:06 kapilkd13

There is no memory overhead, because during inference there is no permutation. In fact, due to the use of relative positional encodings, you can increase seqlen to be larger than the one used during training, which is not possible for BERT. For example, for RACE, we use a seqlen of 640 during finetuning. Another way to do it is to use it similar to TransformerXL, where you can "unroll" the sequence. Each time you process a segment, and use the last segment as the memory.

Jun 28 '19 18:06 kimiyoung

@kimiyoung Thanks for the insight. Can you point me to any resource/code on how you increased seq_len for RACE or can we directly increase in config(while inference). It would be great help.Also by your experience which one would provide better doc vectors for large text(600-900 tokens), TransformerXL or XLNet. Thanks for the help.

Jul 01 '19 07:07 kapilkd13

This is simple. Just increase the max_sequence_length to 640. I would suggest using XLNet as it's bidirectional.

Jul 01 '19 17:07 kimiyoung

Thanks! I am comparing doc vecs generated by Bert,doc2Vec with XLNet and will share the result. you might be interested.

Jul 02 '19 05:07 kapilkd13

@kapilkd13 I am also interested in trying out those 2 options. Just to clarify, are you planning to extract contextual embeddings (CSL) from either BERT or XLNet and then train a doc2vec using the aforementioned embeddings?

If so, have you got interesting results? Do you recommend following that direction?

Jul 09 '19 22:07 katsou55

@katsou55 That is my second option, First I want to try averaging last four layer vector embeddings and consider it as doc vectors( I have stripped the text to first 512 chars in case of bert). On bert the observation was that k-means clustering didn't perform that well as compare to normal docvec. But interestingly if you look for k nearest neighbours of an item, bert vectors made more sense. Though these bert vectors are not true representation of articles. Next, I am thinking of obtaining word vecs from bert and use some rnn bassed encoder for obtaining docvecs. Training Doc2Vec from CSL is also interesting proposition that I haven't thought of. Might try that first.

Jul 12 '19 10:07 kapilkd13

Has anyone done tests of how xlnet performs with progressively larger amounts of tokens? Working on a classification task where the median document length is around 1,700 tokens .

Jul 15 '20 18:07 Shane-Neeley

@Shane-Neeley hey, do you end up with an alternative to do the task ?

Feb 21 '21 18:02 bloodteller123

xlnet xlnet copied to clipboard

[Question]: Like bert xlnet also has a max_len of 512 tokens, what would be good way to process longer text

xlnet
xlnet copied to clipboard