Different rouge scores in 1.11.1 and main
Following the Rouge updates , there are changes in rouge scores. In most cases, the diff is less than 1 point, , but it could be 1-2 point in extreme cases. Not sure why the new implementation should cause any diff.
Code:
from unitxt import get_logger
from unitxt.api import evaluate, load_dataset
from unitxt.blocks import TaskCard
from unitxt.collections_operators import Wrap
from unitxt.inference import (
HFPipelineBasedInferenceEngine,
)
from unitxt.loaders import LoadFromDictionary
from unitxt.text_utils import print_dict
logger = get_logger()
dataset = load_dataset(card="cards.xsum", template_card_index=0,loader_limit=10)
test_dataset = dataset["test"]
# Infer using flan t5 base using HF API
model_name = "google/flan-t5-base"
inference_model = HFPipelineBasedInferenceEngine(
model_name=model_name, max_new_tokens=32
)
predictions = inference_model.infer(test_dataset)
evaluated_dataset = evaluate(predictions=predictions, data=test_dataset)
# Print results
for instance in evaluated_dataset:
print_dict(
instance,
keys_to_print=[
"source",
"prediction",
"processed_prediction",
"references",
"score",
],
)
1.11.1 global: rouge1 (float64): 0.287309678405052 rouge2 (float64): 0.08183079296761228 rougeL (float64): 0.20875488491798488 rougeLsum (float64): 0.20666857055062154 score (float64): 0.20875488491798488 score_name (str): rougeL instance: rouge1 (float64): 0.19047619047619052 rouge2 (float64): 0.05 rougeL (float64): 0.19047619047619052 rougeLsum (float64): 0.19047619047619052 score (float64): 0.19047619047619052 score_name (str): rougeL
main: score: global: rougeL (float64): 0.20879996542849094 score (float64): 0.20879996542849094 score_name (str): rougeL rouge1 (float64): 0.28802664396739114 rouge2 (float64): 0.08172129913073843 rougeLsum (float64): 0.20879996542849094 rougeL_ci_low (float64): 0.15700219128325088 rougeL_ci_high (float64): 0.2718109259051072 score_ci_low (float64): 0.15700219128325088 score_ci_high (float64): 0.2718109259051072 rouge1_ci_low (float64): 0.23669188547490233 rouge1_ci_high (float64): 0.34410005760392737 rouge2_ci_low (float64): 0.04442823342518798 rouge2_ci_high (float64): 0.13301823219319187 rougeLsum_ci_low (float64): 0.15700219128325088 rougeLsum_ci_high (float64): 0.2718109259051072 instance: rouge1 (float): 0.19047619047619052 rouge2 (float): 0.05 rougeL (float): 0.19047619047619052 rougeLsum (float): 0.19047619047619052 score (float): 0.19047619047619052 score_name (str): rougeL
@dafnapension - If possible, please give this priority, because we want to make a new release this week.
Hi @yoavkatz , will gladly do. A difference is expected, since the older version (global HF) with use_aggregator=True, returned a bootstrapped score, whereas now we return a simple average of the instance scores. But I will verify this, and try to understand why the diff is so big.
Hi Dafna. Thanks. Instead of calling the hf inference engine, you can just copy the "target" of instance i to the prediction of instance i+1.
This will simulate a model, and should solve your problem.
See example here
https://github.com/IBM/unitxt/blob/c2fc7ab4caeac1e48d523a34cc34a0cdcc597d16/examples/evaluate_llm_as_judge.py#L43
Thanks, @yoavkatz , I did manage to run something, and at least came out with some references and predictions:
predictions = ['Prisoners in Wales are facing a "desperate need" for one-bedroom flats, a charity has said.', 'A man has been charged with armed robbery after a man was arrested in Edinburgh.', 'Four teenagers have been charged with hate crimes after a white man was beaten and beaten in a Chicago court.', 'West Bromwich Albion have appointed former Arsenal goalkeeper Mark Hughes as their new director of football.', 'A fasting diet that mimics famine and famine has been shown to reverse the symptoms of type 1 and type 2 diabetes.', 'The merger between two major European manufacturers of spectacle frames and lenses is a big deal.', 'Wendy Houvenaghel has said she felt "vindicated" by British Cycling\'s failures in the World Class Programme.', 'The success of comedy clubs in the US is largely due to the fact that people are willing to laugh.', 'BT\'s shares were up 3% on Thursday after the company\'s chief executive, a former Ofcom executive, said the company was "not', 'Brendan Rodgers says he is looking forward to his first Old Firm derby with Celtic on Saturday.']
references = [['There is a "chronic" need for more housing for prison leavers in Wales, according to a charity.'], ['A man has appeared in court after firearms, ammunition and cash were seized by police in Edinburgh.'], ['Four people accused of kidnapping and torturing a mentally disabled man in a "racially motivated" attack streamed on Facebook have been denied bail.'], ['West Brom have appointed Nicky Hammond as technical director, ending his 20-year association with Reading.'], ['The pancreas can be triggered to regenerate itself through a type of fasting diet, say US researchers.'], ['Since their impending merger was announced in January, there has been remarkably little comment about the huge proposed deal to combine Essilor and Luxottica.'], ['A "medal at any cost" approach created a "culture of fear" at British Cycling, says former rider Wendy Houvenaghel.'], ['Have you heard the one about the computer programmer who bought a failing comedy club in Texas and turned it into a million dollar a year business?'], ["The reaction from BT's investors told us much about media regulator Ofcom's ruling on the fate of Openreach, the BT subsidiary that provides much of the UK's broadband infrastructure."], ["Manager Brendan Rodgers is sure Celtic can exploit the wide open spaces of Hampden when they meet Rangers in Sunday's League Cup semi-final."]]
I then copied rouge from 1.11.1, called it OldRouge for this examination:
class OldRouge(HuggingfaceMetric):
hf_metric_name = "rouge"
main_score = "rougeL"
scale = 1.0
prediction_type = "str"
single_reference_per_prediction = False # multiple references allowed
use_aggregator: bool = True
rouge_types: List[str] = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
sent_split_newline: bool = True
_requirements_list: List[str] = ["nltk", "rouge_score"]
def prepare(self):
super().prepare()
self.hf_compute_args.update(
{"use_aggregator": self.use_aggregator, "rouge_types": self.rouge_types}
)
import nltk
nltk.download("punkt")
self.sent_tokenize = nltk.sent_tokenize
def compute(self, references, predictions, task_data: List[Dict]):
if self.sent_split_newline:
predictions = [
"\n".join(self.sent_tokenize(prediction.strip()))
for prediction in predictions
]
references = [
["\n".join(self.sent_tokenize(r.strip())) for r in reference]
for reference in references
]
return super().compute(references, predictions, task_data)
and then, easily produced both scores:
for metric in [OldRouge(), Rouge()]:
print(type(metric))
outputs = apply_metric(metric, predictions, references)
print_dict(outputs[0]["score"])
print("\n")
and received:
<class 'unitxt.metrics.OldRouge'>
global:
rouge1 (float64):
0.2873483299646091
rouge2 (float64):
0.08167020624584168
rougeL (float64):
0.20884075796928347
rougeLsum (float64):
0.20809322745400183
score (float64):
0.20884075796928347
score_name (str):
rougeL
score_ci_low (float64):
0.1643165941694017
score_ci_high (float64):
0.2660200915187788
rougeL_ci_low (float64):
0.1643165941694017
rougeL_ci_high (float64):
0.2660200915187788
instance:
rouge1 (float64):
0.42424242424242425
rouge2 (float64):
0.19354838709677422
rougeL (float64):
0.30303030303030304
rougeLsum (float64):
0.30303030303030304
score (float64):
0.30303030303030304
score_name (str):
rougeL
<class 'unitxt.metrics.Rouge'>
global:
rouge1 (float64):
0.28802664396739114
rouge2 (float64):
0.08172129913073843
rougeLsum (float64):
0.20879996542849094
rougeL (float64):
0.20879996542849094
score (float64):
0.20879996542849094
score_name (str):
rougeL
rouge1_ci_low (float64):
0.23669188547490233
rouge1_ci_high (float64):
0.34410005760392737
rouge2_ci_low (float64):
0.04442823342518798
rouge2_ci_high (float64):
0.13301823219319187
rougeLsum_ci_low (float64):
0.15700219128325088
rougeLsum_ci_high (float64):
0.2718109259051072
rougeL_ci_low (float64):
0.15700219128325088
rougeL_ci_high (float64):
0.2718109259051072
score_ci_low (float64):
0.15700219128325088
score_ci_high (float64):
0.2718109259051072
instance:
rouge1 (float):
0.42424242424242425
rouge2 (float):
0.19354838709677422
rougeL (float):
0.30303030303030304
rougeLsum (float):
0.30303030303030304
score (float):
0.30303030303030304
score_name (str):
rougeL
For the current implementation of Rouge - I got identical results as yours. For the HF Global - slightly different, perhaps for a different seed for the randomization in generating their bootstrap. In the current implementation, we receive identical RougeL and RougeLsum. with HF: very close, but not identical. ( for both metrics, sent_split_newline: bool = True). I think this has to do with their bootstrapping, that we avoid.
Put side by side:
and:
The difference do not look different more than we expected. Instance scores are all identical (as expected, just a sanity check), and the global are not too surprising, I think Do you see something exceptional?
Hhi Dafna. Can you look at all the instance scores and not only the first ? Perhaps there is one instance with a big difference that affects the whole average. As I mentioned, in most runs the diffs is small, but even in the example above Rouge1 is 0.7 point diff.
1.11.1 global: rouge1 (float64): 0.287309678405052 rouge2 (float64): 0.08183079296761228 rougeL (float64): 0.20875488491798488 rougeLsum (float64): 0.20666857055062154
main: rouge1 (float64): 0.28802664396739114 rouge2 (float64): 0.08172129913073843 rougeLsum (float64): 0.20879996542849094
Hi @yoavkatz , of course! Here is the small script I used (over the given predictions and references above):
outputs_rouge = apply_metric(Rouge(), predictions, references)
outputs_old_rouge = apply_metric(OldRouge(), predictions, references)
print("\tCurrent Rouge\tHF Rouge")
for i, (current, old) in enumerate (zip(outputs_rouge, outputs_old_rouge)):
print(f"instance {i}")
for score_name in ["rouge1", "rouge2", "rougeL", "rougeLsum", "score", "score_name"]:
cu_score = current["score"]["instance"][score_name]
ol_score = old["score"]["instance"][score_name]
print(f"{score_name};{cu_score};{ol_score}")
and got this excel. Seems that all instances have identical scores:
| Â | Current Rouge | HF Rouge |
|---|---|---|
| instance 0 | Â | Â |
| rouge1 | 0.424242424 | 0.424242424 |
| rouge2 | 0.193548387 | 0.193548387 |
| rougeL | 0.303030303 | 0.303030303 |
| rougeLsum | 0.303030303 | 0.303030303 |
| score | 0.303030303 | 0.303030303 |
| score_name | rougeL | rougeL |
| instance 1 | Â | Â |
| rouge1 | 0.375 | 0.375 |
| rouge2 | 0.2 | 0.2 |
| rougeL | 0.375 | 0.375 |
| rougeLsum | 0.375 | 0.375 |
| score | 0.375 | 0.375 |
| score_name | rougeL | rougeL |
| instance 2 | Â | Â |
| rouge1 | 0.372093023 | 0.372093023 |
| rouge2 | 0.097560976 | 0.097560976 |
| rougeL | 0.23255814 | 0.23255814 |
| rougeLsum | 0.23255814 | 0.23255814 |
| score | 0.23255814 | 0.23255814 |
| score_name | rougeL | rougeL |
| instance 3 | Â | Â |
| rouge1 | 0.3125 | 0.3125 |
| rouge2 | 0.066666667 | 0.066666667 |
| rougeL | 0.3125 | 0.3125 |
| rougeLsum | 0.3125 | 0.3125 |
| score | 0.3125 | 0.3125 |
| score_name | rougeL | rougeL |
| instance 4 | Â | Â |
| rouge1 | 0.358974359 | 0.358974359 |
| rouge2 | 0.054054054 | 0.054054054 |
| rougeL | 0.153846154 | 0.153846154 |
| rougeLsum | 0.153846154 | 0.153846154 |
| score | 0.153846154 | 0.153846154 |
| score_name | rougeL | rougeL |
| instance 5 | Â | Â |
| rouge1 | 0.2 | 0.2 |
| rouge2 | 0 | 0 |
| rougeL | 0.1 | 0.1 |
| rougeLsum | 0.1 | 0.1 |
| score | 0.1 | 0.1 |
| score_name | rougeL | rougeL |
| instance 6 | Â | Â |
| rouge1 | 0.222222222 | 0.222222222 |
| rouge2 | 0.117647059 | 0.117647059 |
| rougeL | 0.111111111 | 0.111111111 |
| rougeLsum | 0.111111111 | 0.111111111 |
| score | 0.111111111 | 0.111111111 |
| score_name | rougeL | rougeL |
| instance 7 | Â | Â |
| rouge1 | 0.170212766 | 0.170212766 |
| rouge2 | 0 | 0 |
| rougeL | 0.127659574 | 0.127659574 |
| rougeLsum | 0.127659574 | 0.127659574 |
| score | 0.127659574 | 0.127659574 |
| score_name | rougeL | rougeL |
| instance 8 | Â | Â |
| rouge1 | 0.254545455 | 0.254545455 |
| rouge2 | 0.037735849 | 0.037735849 |
| rougeL | 0.181818182 | 0.181818182 |
| rougeLsum | 0.181818182 | 0.181818182 |
| score | 0.181818182 | 0.181818182 |
| score_name | rougeL | rougeL |
| instance 9 | Â | Â |
| rouge1 | 0.19047619 | 0.19047619 |
| rouge2 | 0.05 | 0.05 |
| rougeL | 0.19047619 | 0.19047619 |
| rougeLsum | 0.19047619 | 0.19047619 |
| score | 0.19047619 | 0.19047619 |
| score_name | rougeL | rougeL |
The difference you are looking at (which you bolded in https://github.com/IBM/unitxt/issues/1078#issuecomment-2257871344) is not 0.7, it is 0.0007.
The difference you are looking at (which you bolded in #1078 (comment)) is not 0.7, it is 0.0007.
You are right (I meant 0.07 points and which is 0.0007) in absolute numbers..
The fact that all the instance scores are the same, but the aggregtaion is different is something to consider.
The reason is that in the OldRough
- Instance results - each instance was passed to the HF metric on its own
- Global results - were calculated by passing all the predictions and references to the HF metric
In the new code:
- Instance results - each instance is calculated on it's own
- Global result is the average of instance results.
So it seems there is some difference in the global result between the two approaches.
This is the rouge code:
https://huggingface.co/spaces/evaluate-metric/rouge/blob/e2671c0764b07f287918af2338dfbd162c14cd07/rouge.py#L121
Hi @yoavkatz , yes, in our implementation: we average the instance scores to get the global result. HF, when use_aggregator=True, simply bootstrap the instance scores (resample from the list of instance scores many times, average each resample to get the global score of that resample, and to us -- they return the median of these [resampled] global scores). So, I think we can anticipate some difference.
I will make another experiment: I will use the OldRouge, but with use_aggregator=False, and will average the returned list (of instance scores) by myself. Coming up.
Now I understand. Thank you. I also talked with Elron. Since people are used to the HF Rouge score, we need to be comparable with it. One way to do it is to actually run the same code which uses their bootstrapping and not ours. This would require changing the Rouge code back to a GlobalMetric.
Another option is use our bootstrapping, but allow overwritting the score with with median of the bootstrapping like they do (if a flag is set, not by default).
Thanks, @yoavkatz , just to complete the "proof": indeed, averaging the (list) returned from HF, when use_aggregator=False, yields same results as we get with our implementation:
| Â | Current Rouge | HF Rouge |
|---|---|---|
| Â | Â | Â |
| rouge1 | 0.288026644 | 0.288026644 |
| rougeLsum | 0.208799965 | 0.208799965 |
| rougeL | 0.208799965 | 0.208799965 |
| score | 0.208799965 | 0.208799965 |
| score_name | rougeL | rougeL |
| rouge2 | 0.081721299 | 0.081721299 |
| rouge1_ci_low | 0.236691885 | not_computed |
| rouge1_ci_high | 0.344100058 | not_computed |
| rougeLsum_ci_low | 0.157002191 | not_computed |
| rougeLsum_ci_high | 0.271810926 | not_computed |
| rougeL_ci_low | 0.157002191 | not_computed |
| rougeL_ci_high | 0.271810926 | not_computed |
| score_ci_low | 0.157002191 | not_computed |
| score_ci_high | 0.271810926 | not_computed |
| rouge2_ci_low | 0.044428233 | not_computed |
| rouge2_ci_high | 0.133018232 | not_computed |
(generated by this piece of code, which uses the OldRouge, now with use_aggregator=False, and n_resamples=0 (we can not ci with vectors)):
outputs_rouge = apply_metric(Rouge(), predictions, references)
outputs_old_rouge = apply_metric(OldRouge(), predictions, references)
print("*** global score of old_rouge:*****")
old_global = outputs_old_rouge[0]["score"]["global"]
print_dict(old_global)
print("*** averaging the list of scores using np.nanmean()*** ")
for score_name in old_global:
if score_name == "score_name":
continue
old_global[score_name] = np.nanmean(old_global[score_name])
print_dict(old_global)
print("*** comparing averaged_old_global against current_global*****")
current_global = outputs_rouge[0]["score"]["global"]
for score_name in current_global:
cu_score = current_global[score_name]
if score_name in old_global:
ol_score = old_global[score_name]
else:
ol_score = 'not_computed'
print(f"{score_name};{cu_score};{ol_score}")
The second option may e simpler:
Just change the code here:
What do you think?
hi @yoavkatz , I think we can also offer all the options we have, explain pros and cons, and let the user choose whatever they want.
We want to make it simple - and backward compatible. Later we can change. So we suggest
1 ) have a flag in metric override_score_with_ci_mid which will now only be set to true in Rouge.
2) Change the above code to
result[f"{full_score_name}_ci_low"] = ci.low
result[f"{full_score_name}_ci_high"] = ci.high
if (self.override_score_with_ci_mid):
result[full_score_name] = ci.mid
if score_name == self.main_score:
result["score_ci_low"] = ci.low
result["score_ci_high"] = ci.high
if (self.override_score_with_ci_mid):
result["score"] = ci.mid
3 Set n_resamples to be 1000 in rouge.
Coming up. Set n_resamples to be 1000, to be as HF use?
Yes. That's the default there.
Hi @yoav, I am pushing a PR for you to see. I am running the inference that I used along the day to compare. Still not same scores. Looking into.
Adressed in PR # 1084, and concluded in a decision to maintain the existing implementation: https://github.com/IBM/unitxt/pull/1084#issuecomment-2267440729