MatchSum
MatchSum copied to clipboard
Long documents
What is the maximum possible document length when using MatchSum
?
It seems that when just passing a document into a transformer, e.g. Bert, it can't handle more than 512 or 1024 tokens (depending on the size)
https://github.com/huggingface/transformers/issues/4332
What were some of the longest documents you handled when testing MatchSum?
Do you think it would make sense to train a https://github.com/allenai/longformer instead?
Thanks!
We truncate each document to 512 tokens and feed them to MatchSum because the pre-trained models (BERT, RoBERTa) has a maximum length limit. One possible solution is to train the position embedding beyond the length (such as DiscoBERT), of course, if your data contains a large number of long documents, using LongFormer may be a better way.
@timsuchanek , could you please guide me to integrate matchsumm with longformer? Such as any documentation, notebook, etc.
@ShoubhikBanerjee Have you find any way to summarize long documents.