unitxt Different rouge scores in 1.11.1 and main

Following the Rouge updates , there are changes in rouge scores. In most cases, the diff is less than 1 point, , but it could be 1-2 point in extreme cases. Not sure why the new implementation should cause any diff.

Code:

from unitxt import get_logger
from unitxt.api import evaluate, load_dataset
from unitxt.blocks import TaskCard
from unitxt.collections_operators import Wrap
from unitxt.inference import (
    HFPipelineBasedInferenceEngine,
)
from unitxt.loaders import LoadFromDictionary
from unitxt.text_utils import print_dict

logger = get_logger()

dataset = load_dataset(card="cards.xsum", template_card_index=0,loader_limit=10)
test_dataset = dataset["test"]


# Infer using flan t5 base using HF API
model_name = "google/flan-t5-base"
inference_model = HFPipelineBasedInferenceEngine(
    model_name=model_name, max_new_tokens=32
)
 
predictions = inference_model.infer(test_dataset)
evaluated_dataset = evaluate(predictions=predictions, data=test_dataset)

# Print results
for instance in evaluated_dataset:
    print_dict(
        instance,
        keys_to_print=[
            "source",
            "prediction",
            "processed_prediction",
            "references",
            "score",
        ],
    )

1.11.1 global: rouge1 (float64): 0.287309678405052 rouge2 (float64): 0.08183079296761228 rougeL (float64): 0.20875488491798488 rougeLsum (float64): 0.20666857055062154 score (float64): 0.20875488491798488 score_name (str): rougeL instance: rouge1 (float64): 0.19047619047619052 rouge2 (float64): 0.05 rougeL (float64): 0.19047619047619052 rougeLsum (float64): 0.19047619047619052 score (float64): 0.19047619047619052 score_name (str): rougeL

main: score: global: rougeL (float64): 0.20879996542849094 score (float64): 0.20879996542849094 score_name (str): rougeL rouge1 (float64): 0.28802664396739114 rouge2 (float64): 0.08172129913073843 rougeLsum (float64): 0.20879996542849094 rougeL_ci_low (float64): 0.15700219128325088 rougeL_ci_high (float64): 0.2718109259051072 score_ci_low (float64): 0.15700219128325088 score_ci_high (float64): 0.2718109259051072 rouge1_ci_low (float64): 0.23669188547490233 rouge1_ci_high (float64): 0.34410005760392737 rouge2_ci_low (float64): 0.04442823342518798 rouge2_ci_high (float64): 0.13301823219319187 rougeLsum_ci_low (float64): 0.15700219128325088 rougeLsum_ci_high (float64): 0.2718109259051072 instance: rouge1 (float): 0.19047619047619052 rouge2 (float): 0.05 rougeL (float): 0.19047619047619052 rougeLsum (float): 0.19047619047619052 score (float): 0.19047619047619052 score_name (str): rougeL

Jul 29 '24 20:07 yoavkatz

@dafnapension - If possible, please give this priority, because we want to make a new release this week.

Jul 29 '24 21:07 yoavkatz

Hi @yoavkatz , will gladly do. A difference is expected, since the older version (global HF) with use_aggregator=True, returned a bootstrapped score, whereas now we return a simple average of the instance scores. But I will verify this, and try to understand why the diff is so big.

Jul 30 '24 04:07 dafnapension

Hi Dafna. Thanks. Instead of calling the hf inference engine, you can just copy the "target" of instance i to the prediction of instance i+1.

This will simulate a model, and should solve your problem.

Jul 30 '24 05:07 yoavkatz

See example here

https://github.com/IBM/unitxt/blob/c2fc7ab4caeac1e48d523a34cc34a0cdcc597d16/examples/evaluate_llm_as_judge.py#L43

Jul 30 '24 05:07 yoavkatz

Thanks, @yoavkatz , I did manage to run something, and at least came out with some references and predictions:

predictions = ['Prisoners in Wales are facing a "desperate need" for one-bedroom flats, a charity has said.', 'A man has been charged with armed robbery after a man was arrested in Edinburgh.', 'Four teenagers have been charged with hate crimes after a white man was beaten and beaten in a Chicago court.', 'West Bromwich Albion have appointed former Arsenal goalkeeper Mark Hughes as their new director of football.', 'A fasting diet that mimics famine and famine has been shown to reverse the symptoms of type 1 and type 2 diabetes.', 'The merger between two major European manufacturers of spectacle frames and lenses is a big deal.', 'Wendy Houvenaghel has said she felt "vindicated" by British Cycling\'s failures in the World Class Programme.', 'The success of comedy clubs in the US is largely due to the fact that people are willing to laugh.', 'BT\'s shares were up 3% on Thursday after the company\'s chief executive, a former Ofcom executive, said the company was "not', 'Brendan Rodgers says he is looking forward to his first Old Firm derby with Celtic on Saturday.']

references = [['There is a "chronic" need for more housing for prison leavers in Wales, according to a charity.'], ['A man has appeared in court after firearms, ammunition and cash were seized by police in Edinburgh.'], ['Four people accused of kidnapping and torturing a mentally disabled man in a "racially motivated" attack streamed on Facebook have been denied bail.'], ['West Brom have appointed Nicky Hammond as technical director, ending his 20-year association with Reading.'], ['The pancreas can be triggered to regenerate itself through a type of fasting diet, say US researchers.'], ['Since their impending merger was announced in January, there has been remarkably little comment about the huge proposed deal to combine Essilor and Luxottica.'], ['A "medal at any cost" approach created a "culture of fear" at British Cycling, says former rider Wendy Houvenaghel.'], ['Have you heard the one about the computer programmer who bought a failing comedy club in Texas and turned it into a million dollar a year business?'], ["The reaction from BT's investors told us much about media regulator Ofcom's ruling on the fate of Openreach, the BT subsidiary that provides much of the UK's broadband infrastructure."], ["Manager Brendan Rodgers is sure Celtic can exploit the wide open spaces of Hampden when they meet Rangers in Sunday's League Cup semi-final."]]

I then copied rouge from 1.11.1, called it OldRouge for this examination:

class OldRouge(HuggingfaceMetric):
    hf_metric_name = "rouge"
    main_score = "rougeL"
    scale = 1.0

    prediction_type = "str"
    single_reference_per_prediction = False  # multiple references allowed

    use_aggregator: bool = True
    rouge_types: List[str] = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

    sent_split_newline: bool = True

    _requirements_list: List[str] = ["nltk", "rouge_score"]

    def prepare(self):
        super().prepare()

        self.hf_compute_args.update(
            {"use_aggregator": self.use_aggregator, "rouge_types": self.rouge_types}
        )

        import nltk

        nltk.download("punkt")
        self.sent_tokenize = nltk.sent_tokenize

    def compute(self, references, predictions, task_data: List[Dict]):
        if self.sent_split_newline:
            predictions = [
                "\n".join(self.sent_tokenize(prediction.strip()))
                for prediction in predictions
            ]
            references = [
                ["\n".join(self.sent_tokenize(r.strip())) for r in reference]
                for reference in references
            ]
        return super().compute(references, predictions, task_data)

and then, easily produced both scores:

for metric in [OldRouge(), Rouge()]:
    print(type(metric))
    outputs = apply_metric(metric, predictions, references)
    print_dict(outputs[0]["score"])
    print("\n")

and received:

<class 'unitxt.metrics.OldRouge'>

global:
    rouge1 (float64):
        0.2873483299646091
    rouge2 (float64):
        0.08167020624584168
    rougeL (float64):
        0.20884075796928347
    rougeLsum (float64):
        0.20809322745400183
    score (float64):
        0.20884075796928347
    score_name (str):
        rougeL
    score_ci_low (float64):
        0.1643165941694017
    score_ci_high (float64):
        0.2660200915187788
    rougeL_ci_low (float64):
        0.1643165941694017
    rougeL_ci_high (float64):
        0.2660200915187788
instance:
    rouge1 (float64):
        0.42424242424242425
    rouge2 (float64):
        0.19354838709677422
    rougeL (float64):
        0.30303030303030304
    rougeLsum (float64):
        0.30303030303030304
    score (float64):
        0.30303030303030304
    score_name (str):
        rougeL



<class 'unitxt.metrics.Rouge'>

global:
    rouge1 (float64):
        0.28802664396739114
    rouge2 (float64):
        0.08172129913073843
    rougeLsum (float64):
        0.20879996542849094
    rougeL (float64):
        0.20879996542849094
    score (float64):
        0.20879996542849094
    score_name (str):
        rougeL
    rouge1_ci_low (float64):
        0.23669188547490233
    rouge1_ci_high (float64):
        0.34410005760392737
    rouge2_ci_low (float64):
        0.04442823342518798
    rouge2_ci_high (float64):
        0.13301823219319187
    rougeLsum_ci_low (float64):
        0.15700219128325088
    rougeLsum_ci_high (float64):
        0.2718109259051072
    rougeL_ci_low (float64):
        0.15700219128325088
    rougeL_ci_high (float64):
        0.2718109259051072
    score_ci_low (float64):
        0.15700219128325088
    score_ci_high (float64):
        0.2718109259051072
instance:
    rouge1 (float):
        0.42424242424242425
    rouge2 (float):
        0.19354838709677422
    rougeL (float):
        0.30303030303030304
    rougeLsum (float):
        0.30303030303030304
    score (float):
        0.30303030303030304
    score_name (str):
        rougeL

Jul 30 '24 08:07 dafnapension

For the current implementation of Rouge - I got identical results as yours. For the HF Global - slightly different, perhaps for a different seed for the randomization in generating their bootstrap. In the current implementation, we receive identical RougeL and RougeLsum. with HF: very close, but not identical. ( for both metrics, sent_split_newline: bool = True). I think this has to do with their bootstrapping, that we avoid.

Put side by side:

and:

The difference do not look different more than we expected. Instance scores are all identical (as expected, just a sanity check), and the global are not too surprising, I think Do you see something exceptional?

Jul 30 '24 08:07 dafnapension

Hhi Dafna. Can you look at all the instance scores and not only the first ? Perhaps there is one instance with a big difference that affects the whole average. As I mentioned, in most runs the diffs is small, but even in the example above Rouge1 is 0.7 point diff.

1.11.1 global: rouge1 (float64): 0.287309678405052 rouge2 (float64): 0.08183079296761228 rougeL (float64): 0.20875488491798488 rougeLsum (float64): 0.20666857055062154

main: rouge1 (float64): 0.28802664396739114 rouge2 (float64): 0.08172129913073843 rougeLsum (float64): 0.20879996542849094

Jul 30 '24 09:07 yoavkatz

Hi @yoavkatz , of course! Here is the small script I used (over the given predictions and references above):

outputs_rouge = apply_metric(Rouge(), predictions, references)
outputs_old_rouge = apply_metric(OldRouge(), predictions, references)

print("\tCurrent Rouge\tHF Rouge")
for i, (current, old) in enumerate (zip(outputs_rouge, outputs_old_rouge)):
    print(f"instance {i}")
    for score_name in ["rouge1", "rouge2", "rougeL", "rougeLsum", "score", "score_name"]:
        cu_score = current["score"]["instance"][score_name]
        ol_score = old["score"]["instance"][score_name]
        print(f"{score_name};{cu_score};{ol_score}")

and got this excel. Seems that all instances have identical scores:

	Current Rouge	HF Rouge
instance 0
rouge1	0.424242424	0.424242424
rouge2	0.193548387	0.193548387
rougeL	0.303030303	0.303030303
rougeLsum	0.303030303	0.303030303
score	0.303030303	0.303030303
score_name	rougeL	rougeL
instance 1
rouge1	0.375	0.375
rouge2	0.2	0.2
rougeL	0.375	0.375
rougeLsum	0.375	0.375
score	0.375	0.375
score_name	rougeL	rougeL
instance 2
rouge1	0.372093023	0.372093023
rouge2	0.097560976	0.097560976
rougeL	0.23255814	0.23255814
rougeLsum	0.23255814	0.23255814
score	0.23255814	0.23255814
score_name	rougeL	rougeL
instance 3
rouge1	0.3125	0.3125
rouge2	0.066666667	0.066666667
rougeL	0.3125	0.3125
rougeLsum	0.3125	0.3125
score	0.3125	0.3125
score_name	rougeL	rougeL
instance 4
rouge1	0.358974359	0.358974359
rouge2	0.054054054	0.054054054
rougeL	0.153846154	0.153846154
rougeLsum	0.153846154	0.153846154
score	0.153846154	0.153846154
score_name	rougeL	rougeL
instance 5
rouge1	0.2	0.2
rouge2	0	0
rougeL	0.1	0.1
rougeLsum	0.1	0.1
score	0.1	0.1
score_name	rougeL	rougeL
instance 6
rouge1	0.222222222	0.222222222
rouge2	0.117647059	0.117647059
rougeL	0.111111111	0.111111111
rougeLsum	0.111111111	0.111111111
score	0.111111111	0.111111111
score_name	rougeL	rougeL
instance 7
rouge1	0.170212766	0.170212766
rouge2	0	0
rougeL	0.127659574	0.127659574
rougeLsum	0.127659574	0.127659574
score	0.127659574	0.127659574
score_name	rougeL	rougeL
instance 8
rouge1	0.254545455	0.254545455
rouge2	0.037735849	0.037735849
rougeL	0.181818182	0.181818182
rougeLsum	0.181818182	0.181818182
score	0.181818182	0.181818182
score_name	rougeL	rougeL
instance 9
rouge1	0.19047619	0.19047619
rouge2	0.05	0.05
rougeL	0.19047619	0.19047619
rougeLsum	0.19047619	0.19047619
score	0.19047619	0.19047619
score_name	rougeL	rougeL

Jul 30 '24 10:07 dafnapension

The difference you are looking at (which you bolded in https://github.com/IBM/unitxt/issues/1078#issuecomment-2257871344) is not 0.7, it is 0.0007.

Jul 30 '24 10:07 dafnapension

The difference you are looking at (which you bolded in #1078 (comment)) is not 0.7, it is 0.0007.

You are right (I meant 0.07 points and which is 0.0007) in absolute numbers..

The fact that all the instance scores are the same, but the aggregtaion is different is something to consider.

The reason is that in the OldRough

Instance results - each instance was passed to the HF metric on its own
Global results - were calculated by passing all the predictions and references to the HF metric

In the new code:

Instance results - each instance is calculated on it's own
Global result is the average of instance results.

So it seems there is some difference in the global result between the two approaches.

Jul 30 '24 11:07 yoavkatz

This is the rouge code:

https://huggingface.co/spaces/evaluate-metric/rouge/blob/e2671c0764b07f287918af2338dfbd162c14cd07/rouge.py#L121

Jul 30 '24 11:07 yoavkatz

Hi @yoavkatz , yes, in our implementation: we average the instance scores to get the global result. HF, when use_aggregator=True, simply bootstrap the instance scores (resample from the list of instance scores many times, average each resample to get the global score of that resample, and to us -- they return the median of these [resampled] global scores). So, I think we can anticipate some difference.

I will make another experiment: I will use the OldRouge, but with use_aggregator=False, and will average the returned list (of instance scores) by myself. Coming up.

Jul 30 '24 11:07 dafnapension

Now I understand. Thank you. I also talked with Elron. Since people are used to the HF Rouge score, we need to be comparable with it. One way to do it is to actually run the same code which uses their bootstrapping and not ours. This would require changing the Rouge code back to a GlobalMetric.

Another option is use our bootstrapping, but allow overwritting the score with with median of the bootstrapping like they do (if a flag is set, not by default).

Jul 30 '24 11:07 yoavkatz

Thanks, @yoavkatz , just to complete the "proof": indeed, averaging the (list) returned from HF, when use_aggregator=False, yields same results as we get with our implementation:

	Current Rouge	HF Rouge

rouge1	0.288026644	0.288026644
rougeLsum	0.208799965	0.208799965
rougeL	0.208799965	0.208799965
score	0.208799965	0.208799965
score_name	rougeL	rougeL
rouge2	0.081721299	0.081721299
rouge1_ci_low	0.236691885	not_computed
rouge1_ci_high	0.344100058	not_computed
rougeLsum_ci_low	0.157002191	not_computed
rougeLsum_ci_high	0.271810926	not_computed
rougeL_ci_low	0.157002191	not_computed
rougeL_ci_high	0.271810926	not_computed
score_ci_low	0.157002191	not_computed
score_ci_high	0.271810926	not_computed
rouge2_ci_low	0.044428233	not_computed
rouge2_ci_high	0.133018232	not_computed

(generated by this piece of code, which uses the OldRouge, now with use_aggregator=False, and n_resamples=0 (we can not ci with vectors)):

outputs_rouge = apply_metric(Rouge(), predictions, references)
outputs_old_rouge = apply_metric(OldRouge(), predictions, references)

print("*** global score of old_rouge:*****")
old_global = outputs_old_rouge[0]["score"]["global"]
print_dict(old_global)

print("*** averaging the list of scores using np.nanmean()*** ")
for score_name in old_global:
    if score_name == "score_name":
        continue
    old_global[score_name] = np.nanmean(old_global[score_name])
print_dict(old_global)

print("*** comparing averaged_old_global   against current_global*****")
current_global = outputs_rouge[0]["score"]["global"]
for score_name in current_global:
    cu_score = current_global[score_name]
    if score_name in old_global:
        ol_score = old_global[score_name]
    else:
        ol_score = 'not_computed'
    print(f"{score_name};{cu_score};{ol_score}")

Jul 30 '24 12:07 dafnapension

The second option may e simpler:

Just change the code here:

What do you think?

Jul 30 '24 12:07 yoavkatz

hi @yoavkatz , I think we can also offer all the options we have, explain pros and cons, and let the user choose whatever they want.

Jul 30 '24 12:07 dafnapension

We want to make it simple - and backward compatible. Later we can change. So we suggest

1 ) have a flag in metric override_score_with_ci_mid which will now only be set to true in Rouge. 2) Change the above code to

         result[f"{full_score_name}_ci_low"] = ci.low
        result[f"{full_score_name}_ci_high"] = ci.high
        if (self.override_score_with_ci_mid):
             result[full_score_name] = ci.mid
         
        if score_name == self.main_score:
            result["score_ci_low"] = ci.low
            result["score_ci_high"] = ci.high
             if (self.override_score_with_ci_mid):
                 result["score"] = ci.mid

3 Set n_resamples to be 1000 in rouge.

Jul 30 '24 12:07 yoavkatz

Coming up. Set n_resamples to be 1000, to be as HF use?

Jul 30 '24 12:07 dafnapension

Yes. That's the default there.

Jul 30 '24 12:07 yoavkatz

Hi @yoav, I am pushing a PR for you to see. I am running the inference that I used along the day to compare. Still not same scores. Looking into.

Jul 30 '24 13:07 dafnapension

Adressed in PR # 1084, and concluded in a decision to maintain the existing implementation: https://github.com/IBM/unitxt/pull/1084#issuecomment-2267440729

Sep 22 '24 09:09 dafnapension