mteb icon indicating copy to clipboard operation
mteb copied to clipboard

Add MIRACL

Open Muennighoff opened this issue 1 year ago • 14 comments

Muennighoff avatar Jan 09 '24 08:01 Muennighoff

I may be able do this, is anyone working on it? @Muennighoff @KennethEnevoldsen ?

Should I wait until this PR https://github.com/embeddings-benchmark/mteb/pull/233 is merged ? I think the PR addresses BEIR specifically, so maybe no need to wait until it's merged.

Queries and qrels: https://huggingface.co/datasets/miracl/miracl Corpus: https://huggingface.co/datasets/miracl/miracl-corpus

imenelydiaker avatar Mar 05 '24 21:03 imenelydiaker

Parts of MIRACL are already in MTEB I think - Ideally we unify them so that it's just one MIRACL task from which the different datasets can be selected from

Muennighoff avatar Mar 05 '24 23:03 Muennighoff

Parts of MIRACL are already in MTEB I think - Ideally we unify them so that it's just one MIRACL task from which the different datasets can be selected from

Yep Jina team added some of it (de and es) here: https://github.com/embeddings-benchmark/mteb/blob/main/mteb/tasks/Retrieval/MIRACLRetrieval.py And the Korean version is here: https://github.com/embeddings-benchmark/mteb/blob/main/mteb/tasks/Retrieval/KoMiracl.py

The original dataset is here, with all these languages included:

  • Queries and qrels: https://huggingface.co/datasets/miracl/miracl
  • Corpus: https://huggingface.co/datasets/miracl/miracl-corpus

Maybe we can just use the original dataset with all provided languages ? Do you prefer duplicating the data into a MTEB data repository to make sure it will always be available for the benchmark, or should we just load the dataset from it's original HF repo and transform it on load ?

imenelydiaker avatar Mar 06 '24 08:03 imenelydiaker

Do you prefer duplicating the data into a MTEB data repository to make sure it will always be available for the benchmark, or should we just load the dataset from it's original HF repo and transform it on load ?

Generally, the first option is more robust, but we have multiple datasets atm. which does the second one. The second option is also easier to update (if the dataset is updated).

For MIRACL though I would just go for option 1 (permissive license) as long there are no planned updates.

KennethEnevoldsen avatar Mar 06 '24 10:03 KennethEnevoldsen

Hi @KennethEnevoldsen @imenelydiaker @Muennighoff,

@crystina-z and I are co-authors of the MIRACL benchmark.

We just saw the announcement on Twitter/X of building out the multilingual MTEB, and I saw this issue is open.

We would be happy to help you to integrate MIRACL within MMTEB. Please ask us directly if you have any questions.

Thanks, Nandan

thakur-nandan avatar Apr 11 '24 17:04 thakur-nandan

Hi @thakur-nandan, very happy to have you guys on board. Seems like MIRACL is partly added in a few different ways (partially for some languages and as a reranking task and as a retrieval task). You guys might be interesting unifying those and adding in the missing languages? If you have the time of course. If you do not I will mark this thread with a "help wanted".

KennethEnevoldsen avatar Apr 11 '24 17:04 KennethEnevoldsen

Hi @KennethEnevoldsen sounds good and we'd love to help! I can take the reranking task and @thakur-nandan will handle the retrieval task. We'll start this week and get back as soon as we can.

crystina-z avatar Apr 11 '24 17:04 crystina-z

Wonderful @crystina-z and @thakur-nandan.

KennethEnevoldsen avatar Apr 11 '24 20:04 KennethEnevoldsen

Are you still interested in adding this? Would be amazing! 🙌 cc @crystina-z @thakur-nandan @imenelydiaker

Muennighoff avatar May 01 '24 15:05 Muennighoff

@crystina-z @thakur-nandan if you haven't started yet I can take it from here 😊

imenelydiaker avatar May 01 '24 15:05 imenelydiaker

@imenelydiaker Thanks for your help. Me and @crystina-z have already started to look into both the reranking and retrieval tasks and should have the PR soon!

Regards, Nandan Thakur

thakur-nandan avatar May 01 '24 22:05 thakur-nandan

Hi all! I just submitted #641 for the reranking part. lmk how you think!

crystina-z avatar May 06 '24 22:05 crystina-z

Submitted #642 for the retrieval part. I have not been able to successfully reproduce the mE5-small nDCG@10 numbers.

thakur-nandan avatar May 06 '24 23:05 thakur-nandan

we are currently waiting for #833, which is being worked on by @imenelydiaker so will add you to this issue as well.

KennethEnevoldsen avatar Jun 05 '24 18:06 KennethEnevoldsen