Add multilingual tokenization for ROUGE

Open jon-tow opened this issue 2 years ago • 1 comments

Adds support for multilingual ROUGE scoring by providing language-specific tokenization via nltk.
Adds a code_to_pycountry_lang utility that maps ISO codes to pycountry.db.Language objects for robust language name parsing.
Removes rougeLsum in the default rouge_types arg as sentences are not separated by newlines which breaks the rouge_scorer assumption.

TODO

Add sentence-level tokenization (possibly use nltk.sent_tokenize?). As mentioned above, rouge-score==0.0.4 (the latest package release) expects sentences be split by newlines to compute the rougeLsum score. The latest version on their master branch contains automatic sentence splitting support. Unfortunately, this repo is not pip installable because there exists a module at the project root level named tokenize.py that overrides a module of the same name in pip's setuptools dependency, breaking the installation.
Find a clean abstraction for tagging non-English PromptSourceTasks with their language. This tag could then be used to construct the multilingual NltkWordTokenizer that gets passed into rouge and other metrics that may need multilingual support in the future. Possibly use promptsource's language tagging: https://github.com/bigscience-workshop/promptsource/pull/771

Jun 01 '22 21:06 jon-tow

Can we still use the current ROUGE score in LMEVAL for non-space languages? It seems to me like PaLM used it https://arxiv.org/pdf/2204.02311.pdf for many other languages than English

Also related: ROUGE-scores are 0-1 & BLEU 0-100 in LMEVAL right?

Aug 15 '22 20:08 Muennighoff

lm-evaluation-harness lm-evaluation-harness copied to clipboard

Add multilingual tokenization for ROUGE

TODO

lm-evaluation-harness
lm-evaluation-harness copied to clipboard