xlnet icon indicating copy to clipboard operation
xlnet copied to clipboard

[Question]: Like bert xlnet also has a max_len of 512 tokens, what would be good way to process longer text

Open kapilkd13 opened this issue 5 years ago • 8 comments

I want to compare docvecs obtained from bert with XLnet. In case of bert I take average of last 4 layers to obtain docvec but the max seq is limited to 512 tokens. As I read in XLnet also the seq-len is limited to 512 tokens. Is there a easier/intuitive way to process longer text. Since XLnet considers all the permutation, I can understand the memory overhead with large text. But is there a way? any suggestions are welcome.

kapilkd13 avatar Jun 28 '19 09:06 kapilkd13

There is no memory overhead, because during inference there is no permutation. In fact, due to the use of relative positional encodings, you can increase seqlen to be larger than the one used during training, which is not possible for BERT. For example, for RACE, we use a seqlen of 640 during finetuning. Another way to do it is to use it similar to TransformerXL, where you can "unroll" the sequence. Each time you process a segment, and use the last segment as the memory.

kimiyoung avatar Jun 28 '19 18:06 kimiyoung

@kimiyoung Thanks for the insight. Can you point me to any resource/code on how you increased seq_len for RACE or can we directly increase in config(while inference). It would be great help.Also by your experience which one would provide better doc vectors for large text(600-900 tokens), TransformerXL or XLNet. Thanks for the help.

kapilkd13 avatar Jul 01 '19 07:07 kapilkd13

This is simple. Just increase the max_sequence_length to 640. I would suggest using XLNet as it's bidirectional.

kimiyoung avatar Jul 01 '19 17:07 kimiyoung

Thanks! I am comparing doc vecs generated by Bert,doc2Vec with XLNet and will share the result. you might be interested.

kapilkd13 avatar Jul 02 '19 05:07 kapilkd13

@kapilkd13 I am also interested in trying out those 2 options. Just to clarify, are you planning to extract contextual embeddings (CSL) from either BERT or XLNet and then train a doc2vec using the aforementioned embeddings?

If so, have you got interesting results? Do you recommend following that direction?

katsou55 avatar Jul 09 '19 22:07 katsou55

@katsou55 That is my second option, First I want to try averaging last four layer vector embeddings and consider it as doc vectors( I have stripped the text to first 512 chars in case of bert). On bert the observation was that k-means clustering didn't perform that well as compare to normal docvec. But interestingly if you look for k nearest neighbours of an item, bert vectors made more sense. Though these bert vectors are not true representation of articles. Next, I am thinking of obtaining word vecs from bert and use some rnn bassed encoder for obtaining docvecs. Training Doc2Vec from CSL is also interesting proposition that I haven't thought of. Might try that first.

kapilkd13 avatar Jul 12 '19 10:07 kapilkd13

Has anyone done tests of how xlnet performs with progressively larger amounts of tokens? Working on a classification task where the median document length is around 1,700 tokens .

Shane-Neeley avatar Jul 15 '20 18:07 Shane-Neeley

@Shane-Neeley hey, do you end up with an alternative to do the task ?

bloodteller123 avatar Feb 21 '21 18:02 bloodteller123