unitxt icon indicating copy to clipboard operation
unitxt copied to clipboard

Support for a tokenizer parameter in metrics

Open assaftibm opened this issue 2 years ago • 1 comments

Some metrics, such as Rouge, accept a tokenizer parameter for better support for foreign languages. It will be helpful to expose this option.

https://discuss.huggingface.co/t/which-tokenizer-does-rouge-metric-uses-under-the-hood/19903

https://github.com/google-research/google-research/blob/e3d00617cb28064b6e96ab4e2485079f0ca5a763/rouge/rouge_scorer.py#L60

cc: @perlitz @yoavkatz @gitMichal

assaftibm avatar Nov 01 '23 12:11 assaftibm

i also came across this implementation from the authors of xlsum:

https://github.com/csebuetnlp/xl-sum/tree/master/multilingual_rouge_scoring

also in the meeting with Hans' team, they said that we can use the rouge as is (with the tokenizer), no need for stemming. results will be lower, but we only care about comparison (and not absolute values), so it should be fine

gitMichal avatar Nov 02 '23 10:11 gitMichal