lm-evaluation-harness icon indicating copy to clipboard operation
lm-evaluation-harness copied to clipboard

Add multilingual tokenization for ROUGE

Open jon-tow opened this issue 2 years ago • 1 comments

  • Adds support for multilingual ROUGE scoring by providing language-specific tokenization via nltk.

  • Adds a code_to_pycountry_lang utility that maps ISO codes to pycountry.db.Language objects for robust language name parsing.

  • Removes rougeLsum in the default rouge_types arg as sentences are not separated by newlines which breaks the rouge_scorer assumption.

TODO

  • Add sentence-level tokenization (possibly use nltk.sent_tokenize?). As mentioned above, rouge-score==0.0.4 (the latest package release) expects sentences be split by newlines to compute the rougeLsum score. The latest version on their master branch contains automatic sentence splitting support. Unfortunately, this repo is not pip installable because there exists a module at the project root level named tokenize.py that overrides a module of the same name in pip's setuptools dependency, breaking the installation.

  • Find a clean abstraction for tagging non-English PromptSourceTasks with their language. This tag could then be used to construct the multilingual NltkWordTokenizer that gets passed into rouge and other metrics that may need multilingual support in the future. Possibly use promptsource's language tagging: https://github.com/bigscience-workshop/promptsource/pull/771

jon-tow avatar Jun 01 '22 21:06 jon-tow