Change xlsum.py to run on all languages (remove if lang == langs[0]:)
Run python prepare/cards/xlsum.py
Traceback (most recent call last):
File "/home/runner/work/unitxt/unitxt/tests/test_preperation.py", line 47, in test_preprations
import_module_from_file(file)
File "/home/runner/work/unitxt/unitxt/tests/test_preperation.py", line 27, in import_module_from_file
spec.loader.exec_module(module)
File "", line 843, in exec_module
File "", line 219, in _call_with_frames_removed
File "/home/runner/work/unitxt/unitxt/prepare/cards/xlsum.py", line 42, in
test_card(card, debug=False)
File "/home/runner/work/unitxt/unitxt/src/unitxt/test_utils/card.py", line 238, in test_card
test_with_eval(
File "/home/runner/work/unitxt/unitxt/src/unitxt/test_utils/card.py", line 184, in test_with_eval
raise AssertionError(error_message)
AssertionError: The results of running the main metric in used in the card (rougeL) over simulated predictions that are equal to the references returns a different score than expected.
One would expect a perfect score of 1.0 in this case, but returned metric score was 0.0.
This usually indicates an error in the metric or post processors, but can be also an acceptable edge case.
In anycase, this requires a review. If this is acceptable, set strict=False in the call to test_card().
The predictions passed to the metrics were:
['á áłááá á¨á áá˛áľ á´áľ á áá ááľáĽ áśáá°áŽá˝ á áŤáľ ááŚá˝á á ááĄá˘ á á°á´á˛áą áĽáá˛á
á áááľ ááá áłááśá á°á°ááśá á ááłáá
á á˛á á°ááá¨ááá˘', 'á¨á°áá áŁáá¤áľááłá¸á á¨á á¨á áľááŤá ááᲠá áŁá áá¸á á¨á°áŁá á áľáŤ áľáľáľáľ áá¨áá˝ á¨ááłá°áŤá¸á áá á¨áá
á áŤá áá˛áŤ á°á áááá˝ áááááŞáŤ áá áááˇáá˘', 'á¨á ááŞáŤá ááŹáá°ááľ áśáááľ áľáŤáá á˛á áśá á¨á°á°áá á¨á°áááłáá˝ ááľáá˝ áááŞáŤ á á ááŞáŤ áľáá
áľ áŤáá°áá ááŤááąáľ áĽáá°áá˝á á áľá áá
áááá˘']
My 2 cents, having dug some:
Rouge employs its default tokenizer as first step to computing the score.
When lang = nepali, for example, no token identified, nor in prediction nor in target. Hence the score is 0 for all three examples.
When lang = marathi (!!) it gladly jumps on a '22' encountered in the string, and then target and prediction both are 'tokenized' to: ['22'], and this is a hit! being one of 3 examples, the final score is 0.3333 for this language.
I am not sure how to automatically recognize the language, or whether to use the known language in this case, and where to pull an adequate tokenizer from. whitespace tokenizer does some work, but this is just to prove a concept..