evaluate
evaluate copied to clipboard
Rouge score not backward compatible as recall and precision are no longer returned
Hello
I have seen that in this PR https://github.com/huggingface/evaluate/pull/158 you have removed the recall and precision from the ROUGE score calculation, which now only returns F1 score. May I ask why was this decision made, and why there doesn't seem to be an option to keep recall and precision in the returned output?
This is also a breaking change (in the sense that if I have some code written for evaluate==0.1.2
it will no longer work in evaluate==0.2.2
)
Shouldn't a backward incompatible change require a major version bump according to https://semver.org ?
Thanks for the clarification
Hi @AndreaSottana
Yes, this was a breaking change - we planned to do it before the initial release but it went under. There are a number of advantages moving from the RougeScore
object that was returned to a pure python dict. If you find the recall and precision is useful we could add an option (e.g. detailed=True
) to the compute
call to return those as well.
We haven't had a full major release yet, so there might be some breaking changes here and there, but there are none planned for the core of metrics and we really want to avoid it.
Sorry for the inconvenience!
Thanks @lvwerra for your quick reply.
I definitely agree a pure python dictionary is much better, however I believe it would be possible to add recall and precision in a python dict without necessarily using the old RougeScore
object.
Overall many summarization papers seem to report ROUGE scores based on F1, but some also use scores such as recall (for example for content selection) therefore I believe for researchers it would be valuable to have an option to see recall and precision (maybe in a pure python dict).
I'm happy to use the older version now that I've realised the issue, but perhaps if there is more demand for this detailed=True
feature then it would be worth considering for the future.
Thanks again
I was wondering if this thread was taken into consideration? Because with evaluate
, the Rouge score still only reports one score and not the precision/recall/Fscore.
Thanks!