lm-evaluation-harness
lm-evaluation-harness copied to clipboard
Add multilingual tokenization for ROUGE
-
Adds support for multilingual ROUGE scoring by providing language-specific tokenization via
nltk
. -
Adds a
code_to_pycountry_lang
utility that maps ISO codes topycountry.db.Language
objects for robust language name parsing. -
Removes
rougeLsum
in the defaultrouge_types
arg as sentences are not separated by newlines which breaks therouge_scorer
assumption.
TODO
-
Add sentence-level tokenization (possibly use
nltk.sent_tokenize
?). As mentioned above,rouge-score==0.0.4
(the latest package release) expects sentences be split by newlines to compute therougeLsum
score. The latest version on their master branch contains automatic sentence splitting support. Unfortunately, this repo is not pip installable because there exists a module at the project root level namedtokenize.py
that overrides a module of the same name in pip'ssetuptools
dependency, breaking the installation. -
Find a clean abstraction for tagging non-English
PromptSourceTask
s with their language. This tag could then be used to construct the multilingualNltkWordTokenizer
that gets passed into rouge and other metrics that may need multilingual support in the future. Possibly usepromptsource
's language tagging: https://github.com/bigscience-workshop/promptsource/pull/771
Can we still use the current ROUGE score in LMEVAL for non-space languages? It seems to me like PaLM used it https://arxiv.org/pdf/2204.02311.pdf for many other languages than English
Also related: ROUGE-scores are 0-1 & BLEU 0-100 in LMEVAL right?