xlnet
xlnet copied to clipboard
[Question]: Like bert xlnet also has a max_len of 512 tokens, what would be good way to process longer text
I want to compare docvecs obtained from bert with XLnet. In case of bert I take average of last 4 layers to obtain docvec but the max seq is limited to 512 tokens. As I read in XLnet also the seq-len is limited to 512 tokens. Is there a easier/intuitive way to process longer text. Since XLnet considers all the permutation, I can understand the memory overhead with large text. But is there a way? any suggestions are welcome.
There is no memory overhead, because during inference there is no permutation.
In fact, due to the use of relative positional encodings, you can increase seqlen
to be larger than the one used during training, which is not possible for BERT. For example, for RACE, we use a seqlen
of 640 during finetuning.
Another way to do it is to use it similar to TransformerXL, where you can "unroll" the sequence. Each time you process a segment, and use the last segment as the memory.
@kimiyoung Thanks for the insight. Can you point me to any resource/code on how you increased seq_len for RACE or can we directly increase in config(while inference). It would be great help.Also by your experience which one would provide better doc vectors for large text(600-900 tokens), TransformerXL or XLNet. Thanks for the help.
This is simple. Just increase the max_sequence_length
to 640. I would suggest using XLNet as it's bidirectional.
Thanks! I am comparing doc vecs generated by Bert,doc2Vec with XLNet and will share the result. you might be interested.
@kapilkd13 I am also interested in trying out those 2 options. Just to clarify, are you planning to extract contextual embeddings (CSL) from either BERT or XLNet and then train a doc2vec using the aforementioned embeddings?
If so, have you got interesting results? Do you recommend following that direction?
@katsou55 That is my second option, First I want to try averaging last four layer vector embeddings and consider it as doc vectors( I have stripped the text to first 512 chars in case of bert). On bert the observation was that k-means clustering didn't perform that well as compare to normal docvec. But interestingly if you look for k nearest neighbours of an item, bert vectors made more sense. Though these bert vectors are not true representation of articles. Next, I am thinking of obtaining word vecs from bert and use some rnn bassed encoder for obtaining docvecs. Training Doc2Vec from CSL is also interesting proposition that I haven't thought of. Might try that first.
Has anyone done tests of how xlnet performs with progressively larger amounts of tokens? Working on a classification task where the median document length is around 1,700 tokens .
@Shane-Neeley hey, do you end up with an alternative to do the task ?