PreSumm icon indicating copy to clipboard operation
PreSumm copied to clipboard

Summary Redundancy Error: " the the collection collection collection the the the seven hitler hitler hitler the the older older older the hitler hitler "

Open chandanrao007 opened this issue 4 years ago • 10 comments

Whenever I am using the pre-trained model CNN/DM BertExtAbs that is bertsumextabs_cnndm_final_model.zip (1.8G) for abstractive summarization I get redundancy is my summary, that is the below sentence gets repeated 9-10 times.

" the the collection collection collection the the the seven hitler hitler hitler the the older older older the hitler hitler "

I am using the dev-branch code and the pre-trained models that are present in the master branch.

Also when I use the pre-trained CNN/DM TransformerAbs model same sentences are repeated many times in the summary. What to do?

chandanrao007 avatar Apr 04 '20 16:04 chandanrao007

I'm also facing the issue all generated summaries are the same

GaneshDoosa avatar Apr 05 '20 06:04 GaneshDoosa

I have the exact same issue. I am using a different dataset, but only get sentences that are the same. No matter what checkpoint i check

SebastianVeile avatar Apr 05 '20 14:04 SebastianVeile

how many steps did you train? and what is your dataset size ?

GaneshDoosa avatar Apr 05 '20 14:04 GaneshDoosa

how many steps did you train? and what is your dataset size ?

I trained on 200000 steps, and my dataset is about half the size of the CNN/DM

SebastianVeile avatar Apr 05 '20 15:04 SebastianVeile

The author answered this question in an earlier issue. https://github.com/nlpyang/PreSumm/issues/44 When training on only 1 GPU changing the accum_count to a larger number than 5 seems to fix my problem

SebastianVeile avatar Apr 17 '20 18:04 SebastianVeile

@chandanrao007 seems to be using the already trained model, and so am I but face this issue. What I know is that the models released by the author were trained on 4 GPUs. Has someone verified that effective training by accumulating the gradient on few GPUs or using many GPUs fixes this repetition behavior?

elkd avatar Jun 08 '20 15:06 elkd

I have solved this error. I see a lot opened PR with no one even commenting something. It is a one line fix here Inside a for loop add: if not x.strip(): continue . ( You can add this in the else clause too).

x here are lines(paragraphs) from the source text. When there are spaces between lines, empty string is passed as a line and the model is forced to output summarized sentence based on that empty string. The above code will skip all empty lines.

elkd avatar Jun 09 '20 22:06 elkd

The codeline and problem concern a commit much earlier, has anyone already managed to solve this issue with the current version? Thanks in advance!

progsi avatar Dec 09 '20 10:12 progsi

The author answered this question in an earlier issue. #44 When training on only 1 GPU changing the accum_count to a larger number than 5 seems to fix my problem

Hi, could you please tell me what particular acuum_count did you use?

oldaandozerskaya avatar Apr 08 '21 11:04 oldaandozerskaya

This seems to happen when you try to run the model against a blank text (I'm using pre-trained models), so perhaps it's default output of some kind?

brandonrobertz avatar Dec 15 '21 05:12 brandonrobertz