flair icon indicating copy to clipboard operation
flair copied to clipboard

roberta with long text instances

Open miwieg opened this issue 2 years ago • 8 comments

I am using FLAIR for text classification ("TextClassifier") with RoBERTa. My dataset just contains about 2000 instances but the text instances themselves are fairly long, i.e. longer than 512 tokens. I understand that, in principle, those transformers are not capable of processing such long text instances, so I am wondering how FLAIR solves this issue. (My code using FLAIR is running and producing reasonable results.) Does FLAIR cut off the text after 512 tokens or does pursue a striding window approach?

Thank you very much.

miwieg avatar Jun 29 '22 10:06 miwieg

Hi @miwieg, TransformerEmbeddings provide a parameter allow_long_sentences if that parameter is set to True, the embeddings will take some overlap to compute the token embeddings. (E.g. "This is a very very very long sentence" with a token length of 6 would be split into: "This is a very very very" and "very very long sentence" and both gets embedded.

For TextClassification, you can use this by setting the cls_pooling parameter either to max or mean. To gather the context of all tokens. Consider that using the default cls won't be sufficient, as there only the first sub-sentence will be used.

helpmefindaname avatar Jun 29 '22 10:06 helpmefindaname

Thank you very much for your reply.

What is the default setting of TextClassifier? Does it simply strip off the any tokens following the 512th token? My instances actually comprise more than one sentence. So, I guess cls is not a good choice?

miwieg avatar Jun 29 '22 10:06 miwieg

Just to clarify whether I understood you correctly:

If I follow the typical text classification example, i.e. "Training a Text Classification Model" in: https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_7_TRAINING_A_MODEL.md

Do I just a have to add the following lines after the initialization of the TransformerDocumentEmbeddings-object

 document_embeddings.allow_long_sentences = True
 document_embeddings.cls_pooling = "mean"

It would also be good to know how the large text is processed if everything is left at default.

Thank you.

miwieg avatar Jul 02 '22 13:07 miwieg

I don't know if that works, I would rather add it to the constructor: document_embeddings = TransformerDocumentEmbeddings(..., allow_long_sentences=True, cls_pooling="mean")

helpmefindaname avatar Jul 03 '22 00:07 helpmefindaname

This is what I originally tried but the constructor does not account for these parameters. I ran the code as I suggested above. The code could be run on my data -- I did not receive any error messages. Can I conclude from that that the pooling was implemented as requested?

miwieg avatar Jul 03 '22 06:07 miwieg

You'll never receive error messages by setting variables - no matter if they existed before or not - that doesn't say anything.

Are you sure that you are on the latest version (flair==0.11.4)? If not, you need to update.

helpmefindaname avatar Jul 03 '22 13:07 helpmefindaname

Thank you for the hint the version number. I've updated it. However, the newest version that seems to be available is 0.11.3. Would that version be already outdated or is 0.11.3 fine?

miwieg avatar Jul 04 '22 12:07 miwieg

yes, sorry I meant 0.11.3

helpmefindaname avatar Jul 04 '22 22:07 helpmefindaname

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Nov 13 '22 08:11 stale[bot]