Support for a tokenizer parameter in metrics

Open assaftibm opened this issue 2 years ago • 1 comments

Some metrics, such as Rouge, accept a tokenizer parameter for better support for foreign languages. It will be helpful to expose this option.

https://discuss.huggingface.co/t/which-tokenizer-does-rouge-metric-uses-under-the-hood/19903

https://github.com/google-research/google-research/blob/e3d00617cb28064b6e96ab4e2485079f0ca5a763/rouge/rouge_scorer.py#L60

cc: @perlitz @yoavkatz @gitMichal

Nov 01 '23 12:11 assaftibm

i also came across this implementation from the authors of xlsum:

https://github.com/csebuetnlp/xl-sum/tree/master/multilingual_rouge_scoring

also in the meeting with Hans' team, they said that we can use the rouge as is (with the tokenizer), no need for stemming. results will be lower, but we only care about comparison (and not absolute values), so it should be fine

Nov 02 '23 10:11 gitMichal