MatchSum icon indicating copy to clipboard operation
MatchSum copied to clipboard

Long documents

Open timsuchanek opened this issue 4 years ago • 3 comments

What is the maximum possible document length when using MatchSum? It seems that when just passing a document into a transformer, e.g. Bert, it can't handle more than 512 or 1024 tokens (depending on the size) https://github.com/huggingface/transformers/issues/4332

What were some of the longest documents you handled when testing MatchSum?

Do you think it would make sense to train a https://github.com/allenai/longformer instead?

Thanks!

timsuchanek avatar Jun 02 '20 11:06 timsuchanek

We truncate each document to 512 tokens and feed them to MatchSum because the pre-trained models (BERT, RoBERTa) has a maximum length limit. One possible solution is to train the position embedding beyond the length (such as DiscoBERT), of course, if your data contains a large number of long documents, using LongFormer may be a better way.

maszhongming avatar Jun 03 '20 01:06 maszhongming

@timsuchanek , could you please guide me to integrate matchsumm with longformer? Such as any documentation, notebook, etc.

ShoubhikBanerjee avatar Jun 13 '20 05:06 ShoubhikBanerjee

@ShoubhikBanerjee Have you find any way to summarize long documents.

tanmayag78 avatar Oct 28 '20 15:10 tanmayag78