DebateSum icon indicating copy to clipboard operation
DebateSum copied to clipboard

Training the transformer

Open JonasxRie opened this issue 2 years ago • 2 comments

Hi @Hellisotherpeople ,

as you wrote in your paper, you trained several transformer models, including BERT-large and Longformer-base. You also mentioned the usage of simple-transformer library. Could you share a short code snippet on how you trained the model for extractive summarization, please?

Thanks in advance!

JonasxRie avatar Aug 12 '21 09:08 JonasxRie

@JonasxRie

Yes, I promise I will get to doing this (let's hope before Christmas!) The main difficulty is converting the dataset from its current form into a token classification style format. I actually have since lost this script in a recent losing battle with my local manjaro install that is now formatted - so I will have to rewrite it. This is not difficult but a bit tedious to do.

You can currently try to do word-level extractive summarization by formatting it as a sequence to sequence task - but my experience is that none of these pre-trained language models deduce that the output sequence must keep the original words outputted in the original order or else it stops being "extractive" in the sense that myself and competitive debaters are looking for (like an actual highlighter). I've tried to put tags around the labels - but the models are still too stupid to figure it out. Honestly would love some help from anyone in the community who might have insights on how to fix this issue.

As a side note.

I do warn anyone trying to play the benchmark chasing game that the original evaluation done in the paper is hilariously bad because the default settings in pyrouge only look at the first 100 tokens of the summary. This seems to be done for performance reasons as when I realized this (after the paper was published), running it with no limit has pretty much always eventually crashed.

I think that ROUGE is now built in as an evaluator in many frameworks (such as huggingface) which should solve this problem now.

As such, I want it on the record that future authors should discount my reported benchmark numbers and link to my github comment here to indicate why this is. I would prefer that people not try to play the benchmark chasing game (at least with ROUGE) because summarization is an inherently subjective task. The space of potential "good" summarizations explodes when you do it at the word level on longish documents - of which this dataset ultimately is.

Proper evaluation would almost certainly report significantly different results than what's found in the paper. Future authors should instead reevaluate the models I reported scores for properly, and note the error made in this paper.

Hellisotherpeople avatar Dec 08 '21 22:12 Hellisotherpeople

And I will also make an effort to get trained models/weights posted and hosted on huggingface - but this may take some time.

Hellisotherpeople avatar Dec 08 '21 22:12 Hellisotherpeople