evaluate icon indicating copy to clipboard operation
evaluate copied to clipboard

Rouge score not backward compatible as recall and precision are no longer returned

Open AndreaSottana opened this issue 2 years ago • 2 comments

Hello

I have seen that in this PR https://github.com/huggingface/evaluate/pull/158 you have removed the recall and precision from the ROUGE score calculation, which now only returns F1 score. May I ask why was this decision made, and why there doesn't seem to be an option to keep recall and precision in the returned output?

This is also a breaking change (in the sense that if I have some code written for evaluate==0.1.2 it will no longer work in evaluate==0.2.2) Shouldn't a backward incompatible change require a major version bump according to https://semver.org ?

Thanks for the clarification

AndreaSottana avatar Aug 18 '22 16:08 AndreaSottana

Hi @AndreaSottana

Yes, this was a breaking change - we planned to do it before the initial release but it went under. There are a number of advantages moving from the RougeScore object that was returned to a pure python dict. If you find the recall and precision is useful we could add an option (e.g. detailed=True) to the compute call to return those as well.

We haven't had a full major release yet, so there might be some breaking changes here and there, but there are none planned for the core of metrics and we really want to avoid it.

Sorry for the inconvenience!

lvwerra avatar Aug 18 '22 17:08 lvwerra

Thanks @lvwerra for your quick reply.

I definitely agree a pure python dictionary is much better, however I believe it would be possible to add recall and precision in a python dict without necessarily using the old RougeScore object. Overall many summarization papers seem to report ROUGE scores based on F1, but some also use scores such as recall (for example for content selection) therefore I believe for researchers it would be valuable to have an option to see recall and precision (maybe in a pure python dict). I'm happy to use the older version now that I've realised the issue, but perhaps if there is more demand for this detailed=True feature then it would be worth considering for the future.

Thanks again

AndreaSottana avatar Aug 18 '22 17:08 AndreaSottana

I was wondering if this thread was taken into consideration? Because with evaluate, the Rouge score still only reports one score and not the precision/recall/Fscore.

Thanks!

hanane-djeddal avatar Sep 20 '23 13:09 hanane-djeddal