PreSumm
PreSumm copied to clipboard
Summary Redundancy Error: " the the collection collection collection the the the seven hitler hitler hitler the the older older older the hitler hitler "
Whenever I am using the pre-trained model CNN/DM BertExtAbs that is bertsumextabs_cnndm_final_model.zip (1.8G) for abstractive summarization I get redundancy is my summary, that is the below sentence gets repeated 9-10 times.
" the the collection collection collection the the the seven hitler hitler hitler the the older older older the hitler hitler "
I am using the dev-branch code and the pre-trained models that are present in the master branch.
Also when I use the pre-trained CNN/DM TransformerAbs model same sentences are repeated many times in the summary. What to do?
I'm also facing the issue all generated summaries are the same
I have the exact same issue. I am using a different dataset, but only get sentences that are the same. No matter what checkpoint i check
how many steps did you train? and what is your dataset size ?
how many steps did you train? and what is your dataset size ?
I trained on 200000 steps, and my dataset is about half the size of the CNN/DM
The author answered this question in an earlier issue. https://github.com/nlpyang/PreSumm/issues/44 When training on only 1 GPU changing the accum_count to a larger number than 5 seems to fix my problem
@chandanrao007 seems to be using the already trained model, and so am I but face this issue. What I know is that the models released by the author were trained on 4 GPUs. Has someone verified that effective training by accumulating the gradient on few GPUs or using many GPUs fixes this repetition behavior?
I have solved this error. I see a lot opened PR with no one even commenting something. It is a one line fix here Inside a for loop add: if not x.strip(): continue
. ( You can add this in the else clause too).
x here are lines(paragraphs) from the source text. When there are spaces between lines, empty string is passed as a line and the model is forced to output summarized sentence based on that empty string. The above code will skip all empty lines.
The codeline and problem concern a commit much earlier, has anyone already managed to solve this issue with the current version? Thanks in advance!
The author answered this question in an earlier issue. #44 When training on only 1 GPU changing the accum_count to a larger number than 5 seems to fix my problem
Hi, could you please tell me what particular acuum_count did you use?
This seems to happen when you try to run the model against a blank text (I'm using pre-trained models), so perhaps it's default output of some kind?