mteb icon indicating copy to clipboard operation
mteb copied to clipboard

Some datasets for languages.

Open x-tabdeveloping opened this issue 10 months ago • 7 comments

I'm gonna practice drums for the rest of the day and probably won't work tomorrow, but for those who are looking to contribute and get some of those juicy points here is some low-hanging fruit in diverse languages:

Slovak:

  • ~~Sentiment: https://huggingface.co/datasets/sepidmnorozy/Slovak_sentiment~~ (as a matter of fact she has loads of Sentiment classification datasets: https://huggingface.co/sepidmnorozy)
  • News Summarization: https://huggingface.co/datasets/kiviki/SlovakSum

Greek:

  • ~~Legal code clustering: https://huggingface.co/datasets/AI-team-UoA/greek_legal_code~~
  • NLI: https://huggingface.co/datasets/Harsit/xnli2.0_greek
  • Medical QA: https://huggingface.co/datasets/ilsp/medical_mcqa_greek

Maltese:

  • News titles: https://huggingface.co/datasets/MLRS/maltese_news_headlines
  • News categories: https://huggingface.co/datasets/MLRS/maltese_news_categories

x-tabdeveloping avatar Apr 18 '24 10:04 x-tabdeveloping

I'm gonna pick up kiviki/SlovakSum if noone is on it yet.

dokato avatar Apr 22 '24 16:04 dokato

On the other hand it seems like the summary task requires:

        human_summaries: list[str]
        machine_summaries: list[str]
        relevance: list[float] (the score of the machine generated summaries)

and kiviki/SlovakSum doesn't have neither machine_summaries nor relevance scores.

dokato avatar Apr 24 '24 09:04 dokato

@dokato Try formulating it as a retrieval task instead :))

x-tabdeveloping avatar Apr 24 '24 11:04 x-tabdeveloping

I can start working on the Maltese datasets if no one is

wissam-sib avatar May 08 '24 11:05 wissam-sib

@wissam-sib Please verify that no one has added them yet or is working on a PR, otherwise feel free to go ahead :D

x-tabdeveloping avatar May 08 '24 12:05 x-tabdeveloping

News categories is being added so I'm gonna go for the NLI one

wissam-sib avatar May 08 '24 12:05 wissam-sib

I will take care of Greek medical QA: https://huggingface.co/datasets/ilsp/medical_mcqa_greek

mariyahendriksen avatar May 20 '24 16:05 mariyahendriksen

Will close this issue for now - I assume many of these are still relevant to add if so we should probably create separate PRs for these.

@mariyahendriksen do you still want to add the greek medical QA?

KennethEnevoldsen avatar Sep 09 '24 15:09 KennethEnevoldsen