I also got slightly Lower Rouge score for the same code

Open milktea0917 opened this issue 3 years ago • 0 comments

Hello! I appreciate and inspire by your great work on Extractive Summarization. So, i had run your script on your github and got Rouge following Rouge Scores:

For Transformer

On Paper : ROUGE-F(1/2/3/l): 43.25/20.24/39.63 When I ran, and the best score : ROUGE-F(1/2/3/l): 43.04/20.19/39.48

The values written here are the best scores I got from the model following the instructions and paper shows average of 3, which would be much lower in my case, too. This part is same as issue #100 .

What i have done for reproducing:

CNN/DM's Data was from https://drive.google.com/file/d/1DN7ClZCCXsk2KegmC6t4ClBwtAf5galI/view , which is provided and without doing any preprocess on my own.
ROUGE TEST has successfully pass.
Both training and validation settings are same as https://github.com/ShehabMMohamed/PreSumm#readme

I have used one Nvidia 1080 GPU .

But if i used the upper bound of each rouge score , for example:

Than if we average those rouge score -> ROUGE-F1: (43.278+43.058+43.187)/3 = 43.1743333333 ROUGE-F2: (20.441 + 20.414 + 20.297)/3 = 20.384 ROUGE-FL: (39.709 + 39.662 + 39.541)/3 = 39.6373333333 This score seems to be more acceptable than the futher result 43.04/20.19/39.48 which is the single model checkpoint result without averaging. This new result is doing average and also close to the score on paper.

So, i got several questions:

Q1: Can the train setting on readme get the same score from paper? If not, may i ask for the settings in order to reproduce a better score? Q2: May i ask for the three model checkpoint which you had selected for testing phase? Q3: Is the score should be calculate as i futher discuss(using upper bound from rouge score range) ?

Hope to get your response ! Thanks

Jul 15 '22 04:07 milktea0917