unitxt icon indicating copy to clipboard operation
unitxt copied to clipboard

rougeL returns 0 score on perfect prediction in some languages

Open yoavkatz opened this issue 1 year ago • 1 comments

Change xlsum.py to run on all languages (remove if lang == langs[0]:) Run python prepare/cards/xlsum.py

Traceback (most recent call last): File "/home/runner/work/unitxt/unitxt/tests/test_preperation.py", line 47, in test_preprations import_module_from_file(file) File "/home/runner/work/unitxt/unitxt/tests/test_preperation.py", line 27, in import_module_from_file spec.loader.exec_module(module) File "", line 843, in exec_module File "", line 219, in _call_with_frames_removed File "/home/runner/work/unitxt/unitxt/prepare/cards/xlsum.py", line 42, in test_card(card, debug=False) File "/home/runner/work/unitxt/unitxt/src/unitxt/test_utils/card.py", line 238, in test_card test_with_eval( File "/home/runner/work/unitxt/unitxt/src/unitxt/test_utils/card.py", line 184, in test_with_eval raise AssertionError(error_message) AssertionError: The results of running the main metric in used in the card (rougeL) over simulated predictions that are equal to the references returns a different score than expected. One would expect a perfect score of 1.0 in this case, but returned metric score was 0.0. This usually indicates an error in the metric or post processors, but can be also an acceptable edge case. In anycase, this requires a review. If this is acceptable, set strict=False in the call to test_card(). The predictions passed to the metrics were: ['በታይዋን ከአንዲት ሴት አይን ውስጥ ዶክተሮች አራት ንቦችን አወጡ። በደሴቲቱ እንዲህ አይነት ነገር ታይቶም ተሰምቶም አይታወቅም ሲሉ ተናግረዋል።', 'ከሰሞኑ ባለቤትነታቸው የአረና ትግራይ ፓርቲ አባል ናቸው የተባሉ አስራ ስድስት ፍየሎች የመታሰራቸው ዜና የማህበራዊ ሚዲያ ተጠቃሚዎች መነጋገሪያ ሆኖ ቆይቷል።', 'የአሜሪካው ፕሬዝደንት ዶናልድ ትራምፕ ቲክ ቶክ የተሰኘው የተንቀሳቃሽ ምስሎች መጋሪያ በአሜሪካ ድርጅት ካልተገዛ ሊያግዱት እንደሚችሉ አስጠንቅቀዋል።']

yoavkatz avatar Jan 01 '24 10:01 yoavkatz

My 2 cents, having dug some: Rouge employs its default tokenizer as first step to computing the score. When lang = nepali, for example, no token identified, nor in prediction nor in target. Hence the score is 0 for all three examples. When lang = marathi (!!) it gladly jumps on a '22' encountered in the string, and then target and prediction both are 'tokenized' to: ['22'], and this is a hit! being one of 3 examples, the final score is 0.3333 for this language.

I am not sure how to automatically recognize the language, or whether to use the known language in this case, and where to pull an adequate tokenizer from. whitespace tokenizer does some work, but this is just to prove a concept..

dafnapension avatar Jan 07 '24 17:01 dafnapension