seq2seq About ROUGE scores

Hi,

I used the seq2seq/metrics/rouge.py on my repo to add some features.

I wanted to check my results, therefore I compared my script, with yours, and with pyrouge (a wrapper around official rouge script) & pythonrouge (not the lastest commit)(a perl wrapper too)

It turns out that (pltrdy.rouge == seq2seq.metrics.rouge) != (pythonrouge == pyrouge) I show below how to compare seq2seq.metrics.rouge with pyrouge.

Setup

For seq2seq.metrics.rouge.py, I just added:

if __name__ == "__main__":
  import sys 
  import json
  hyp = sys.argv[1]
  ref = sys.argv[2]

  print(json.dumps(rouge([hyp], [ref]), indent=4))

pyrouge (see pyrouge on pypi) which wraps the official ROUGE-1.5.5 (perl script).

sudo pip install pyrouge
pyrouge_set_rouge_path /<absolute_path_to_ROUGE>/RELEASE-1.5.5/
mkdir hyp
mkdir ref
echo $HYP ./hyp/hyp.001.txt
echo $REF ./ref/ref.A.001.txt

eval_pyrouge.py:

from pyrouge import Rouge155

r = Rouge155()
r.system_dir = './hyp'
r.model_dir = './ref'
r.system_filename_pattern = 'hyp.(\d+).txt'
r.model_filename_pattern = 'ref.[A-Z].#ID#.txt'

output = r.convert_and_evaluate()
print(output)
output_dict = r.output_to_dict(output)

Run

Values:

HYP="Tokyo is the one of the biggest city in the world."
REF="The capital of Japan, Tokyo, is the center of Japanese economy."

python rouge.py "$HYP" "$REF":

{
    "rouge_l/r_score": 0.27272727272727271, 
    "rouge_l/p_score": 0.27272727272727271, 
    "rouge_1/r_score": 0.29999999999999999, 
    "rouge_1/p_score": 0.33333333333333331, 
    "rouge_l/f_score": 0.27272727272677272, 
    "rouge_1/f_score": 0.31578946869806096, 
    "rouge_2/f_score": 0.099999995000000272, 
    "rouge_2/p_score": 0.10000000000000001, 
    "rouge_2/r_score": 0.10000000000000001
}

python eval_pyrouge.py:

1 ROUGE-1 Average_R: 0.45455 (95%-conf.int. 0.45455 - 0.45455)
1 ROUGE-1 Average_P: 0.45455 (95%-conf.int. 0.45455 - 0.45455)
1 ROUGE-1 Average_F: 0.45455 (95%-conf.int. 0.45455 - 0.45455)
---------------------------------------------
1 ROUGE-2 Average_R: 0.20000 (95%-conf.int. 0.20000 - 0.20000)
1 ROUGE-2 Average_P: 0.20000 (95%-conf.int. 0.20000 - 0.20000)
1 ROUGE-2 Average_F: 0.20000 (95%-conf.int. 0.20000 - 0.20000)
---------------------------------------------
[...]
---------------------------------------------
1 ROUGE-L Average_R: 0.36364 (95%-conf.int. 0.36364 - 0.36364)
1 ROUGE-L Average_P: 0.36364 (95%-conf.int. 0.36364 - 0.36364)
1 ROUGE-L Average_F: 0.36364 (95%-conf.int. 0.36364 - 0.36364)
---------------------------------------------
[...]

Any idea?

Thx

pltrdy

Mar 20 '17 18:03 pltrdy

Yes, that's expected. The "official" ROUGE script does a bunch of stemming, tokenization, and other things before calculating the score. The ROUGE metric in here doesn't do any of this, but it's a good enough proxy to use during training for getting a sense of what the score will be. As the amount of data increases and sentences become more similar it should be relatively close (at least in my experiments)

So the recommended thing to do is to still run the official ROUGE script on the final model if you want to compare to published results.

I don't want to use pyrouge, or some kind of other wrapper around the ROUGE script, because it's

A real pain to install and get working on various machines
Not openly available, at least not "officially"

I'd love to make the internal score behave more like the official one, but not sure if that's really worth the effort.

Mar 21 '17 17:03 dennybritz

Ok it make sense.

I found some results where the difference was around 9 rouge points (on 11.5k sentences) which is not close at all. I maybe did a mistake somewhere, or as you said, I just can't use it anywhere else other than in training. I wanted to have it to score my predictions which is totally impossible with the current variance.

Anyway thx for the clarification

Mar 22 '17 11:03 pltrdy

Hm, that's interesting. It would be great to look more into it. When I trained on Gigaword the scores were relatively close.

Mar 22 '17 16:03 dennybritz

Another reason perhaps your code has some bugs on calculating Rouge Scores. https://github.com/pltrdy/rouge/blob/master/rouge/rouge.py#L71 From line 71 to 76 they have much more blank on the left than expected. (Try not use copy and pause in Python, LoL)

But it seems that has little influence on your example here.

Jun 05 '17 18:06 KaiQiangSong

https://github.com/google/seq2seq/blob/7f485894d412e8d81ce0e07977831865e44309ce/seq2seq/metrics/rouge.py#L46 Oh, why use set, if we have multiple same ngram, it will be wrong? I see the paper use COUNT instead of the unique word num? please tell me something, I am almost crazy

Jan 11 '18 12:01 fseasy