mteb
mteb copied to clipboard
Some datasets for languages.
I'm gonna practice drums for the rest of the day and probably won't work tomorrow, but for those who are looking to contribute and get some of those juicy points here is some low-hanging fruit in diverse languages:
Slovak:
- ~~Sentiment: https://huggingface.co/datasets/sepidmnorozy/Slovak_sentiment~~ (as a matter of fact she has loads of Sentiment classification datasets: https://huggingface.co/sepidmnorozy)
- News Summarization: https://huggingface.co/datasets/kiviki/SlovakSum
Greek:
- ~~Legal code clustering: https://huggingface.co/datasets/AI-team-UoA/greek_legal_code~~
- NLI: https://huggingface.co/datasets/Harsit/xnli2.0_greek
- Medical QA: https://huggingface.co/datasets/ilsp/medical_mcqa_greek
Maltese:
- News titles: https://huggingface.co/datasets/MLRS/maltese_news_headlines
- News categories: https://huggingface.co/datasets/MLRS/maltese_news_categories
I'm gonna pick up kiviki/SlovakSum
if noone is on it yet.
On the other hand it seems like the summary task requires:
human_summaries: list[str]
machine_summaries: list[str]
relevance: list[float] (the score of the machine generated summaries)
and kiviki/SlovakSum
doesn't have neither machine_summaries nor relevance scores.
@dokato Try formulating it as a retrieval task instead :))
I can start working on the Maltese datasets if no one is
@wissam-sib Please verify that no one has added them yet or is working on a PR, otherwise feel free to go ahead :D
News categories is being added so I'm gonna go for the NLI one
I will take care of Greek medical QA: https://huggingface.co/datasets/ilsp/medical_mcqa_greek
Will close this issue for now - I assume many of these are still relevant to add if so we should probably create separate PRs for these.
@mariyahendriksen do you still want to add the greek medical QA?