mteb Adding German to MTEB

Hi everybody,

I think it would be great to add German to the Benchmark as well, in a broader manner.

I would like to support, if someone would tell me where to help I would be happy! Are the present /proposed datasets enough, should I machine-translate some?

These dataset are already here:

Classification:

AmazonCounterfactualClassification AmazonPolarityClassification - missing (3.4 million train, 400k test) AmazonReviewsClassification Banking77Classification - missing (10k rows train) EmotionClassification - missing (16k rows train) ImdbClassification - missing (25k rows train test each) MassiveIntentClassification MassiveScenarioClassification MTOPDomainClassification MTOPIntentClassification ToxicConversationsClassification - missing (50k train / test each) TweetSentimentExtractionClassification - missing this could be a replacement

Clustering: one could use XNLI German Split TwitterURLCorpus - could be translated MMarcoReranking

Reranking: tba

STS: STS22 STS-B - (MT) STS17 - could also be MT for the 250ish rows.

Would love to have feedback!

Dec 19 '23 15:12 achibb

Yeah that'd be great. I'd be happy to add an Overall German tab once we have ~30 datasets (https://huggingface.co/spaces/mteb/leaderboard)

Note that for Clustering there are some German datasets already thanks to @slvnwhrl who may also be interested in helping out with this effort.

We should aim to minimize MT and use as many human-written datasets as possible I think. Some datasets from here https://github.com/embeddings-benchmark/mteb/pull/174 may also be available in German

Dec 19 '23 15:12 Muennighoff

Hi,

@Muennighoff thanks for including me. I'd be happy to help. And I agree, there should be enough German open source datasets out there, at least for some of the tasks. To give som suggestions:

Classification:

SCARE - The Sentiment Corpus of App Reviews with Fine-grained Annotations in German
Germeval Task 2017: Shared Task on Aspect-based Sentiment in Social Media Customer Feedback
GermEval2021 - Toxic, Engaging, & Fact-Claiming Comments
GermEval 2019 Task 1 -- Shared task on hierarchical classification of blurbs (note that the current German clustering benchmark uses the same data)
Ten Thousand German News Articles Dataset (note that the current German clustering benchmark uses the same data)

Reranking:

GermanQuAD and GermanDPR

These are some German datasets that come to my mind at the moment. I am sure there are more, although, for some tasks it might be harder to find good datasets. I also haven't checked all of the licenses of the listed datasets. This repository could also be of help: German-NLP

Dec 20 '23 07:12 slvnwhrl

Hey there :)... currently working on implementing a retrieval benchmark based on GermanQuAD. Will publish a PR soon and keep you updated here. If you want to chat about it/join the discussion, here's a link to the DiscoResearch discord: https://discord.gg/FBvnqsDS

Jan 01 '24 13:01 rasdani

Working on it here: https://github.com/DiscoResearch/mteb/tree/germanquad-retrieval

Here are first results for intfloat/multilingual-e5-small on the test split of deepset/germanquad.

INFO:root:MRR@1: 0.8720
INFO:root:MRR@3: 0.9091
INFO:root:MRR@5: 0.9130
INFO:root:MRR@10: 0.9139
INFO:root:MRR@100: 0.9149
INFO:root:MRR@1000: 0.9149

Are the scores on the actual HF leaderboard multiplied by 100 or why are they in the range 0-100? 🤔

And I couldn't find the code for the actual HF space. Is it not open source?

Jan 04 '24 19:01 rasdani

Great! Yes they are multiplied by 100 to be from 0-100 in order to make it more readable :) Everything is open-source - Do you mean this code https://huggingface.co/spaces/mteb/leaderboard/blob/main/app.py ?

Jan 04 '24 20:01 Muennighoff

Ah yes thank you! :) I knew about the "Files" tab in spaces but somehow overlooked it this time 😅

I'm currently testing with intfloat/multilingual-e5-small.

As you might know these need to be prompted in a specific way for full capability: https://huggingface.co/intfloat/multilingual-e5-large#usage

I can't find a corresponding model class in mtebscripts although the e5 embeddings are on the leaderboard.

And how do I contribute a benchmarks specifically? Everything runs fine with my run_mteb_german.py.

As I understand the README and the other scripts, that's enough and you take care of running it for different embeddings?

If so the only things I see left to do are:

clean/finish up my fork a bit
host the GermanQuAD dataset in BEIR format on HF instead of generating it locally

I will create and link a draft PR in a minute, so you can compare the changes more easily ;)

Jan 05 '24 17:01 rasdani

Draft PR: #197

Jan 05 '24 17:01 rasdani

one more note: deepset/germanquad has only one relevant context per question. Therefore only one matching context can be retrieved from the corpus and MRR would be the best metric to score this, correct?

Jan 05 '24 17:01 rasdani

Great work! For running the evaluation, they have a section on their HF hub page: https://huggingface.co/intfloat/multilingual-e5-large#mteb-benchmark-evaluation

Therefore only one matching context can be retrieved from the corpus and MRR would be the best metric to score this, correct?

I think you can still use nDCG but MRR is fine too

Jan 05 '24 18:01 Muennighoff

FYI there is this German fork already https://github.com/jina-ai/mteb-de?ref=jina-ai-gmbh.ghost.io

Jan 16 '24 16:01 malteos

FYI there is this German fork already https://github.com/jina-ai/mteb-de?ref=jina-ai-gmbh.ghost.io

Nice, do you plan on opening a PR? Would be to help 🙌

Jan 16 '24 16:01 Muennighoff

Given the merged PR it seems like this issue is resolved? Though we might still be missing a german tab on the leaderboard. If that is the case we can create a separate issue on this.

Mar 05 '24 07:03 KennethEnevoldsen

Yes will close this issue then as lots of good development. Thanks all!

Will open a new one here for the German tab and see if / how I can support !

May 04 '24 17:05 achibb

mteb mteb copied to clipboard

Adding German to MTEB

mteb
mteb copied to clipboard