mteb
mteb copied to clipboard
Add MIRACL
I may be able do this, is anyone working on it? @Muennighoff @KennethEnevoldsen ?
Should I wait until this PR https://github.com/embeddings-benchmark/mteb/pull/233 is merged ? I think the PR addresses BEIR specifically, so maybe no need to wait until it's merged.
Queries and qrels: https://huggingface.co/datasets/miracl/miracl Corpus: https://huggingface.co/datasets/miracl/miracl-corpus
Parts of MIRACL are already in MTEB I think - Ideally we unify them so that it's just one MIRACL task from which the different datasets can be selected from
Parts of MIRACL are already in MTEB I think - Ideally we unify them so that it's just one MIRACL task from which the different datasets can be selected from
Yep Jina team added some of it (de and es) here: https://github.com/embeddings-benchmark/mteb/blob/main/mteb/tasks/Retrieval/MIRACLRetrieval.py And the Korean version is here: https://github.com/embeddings-benchmark/mteb/blob/main/mteb/tasks/Retrieval/KoMiracl.py
The original dataset is here, with all these languages included:
- Queries and qrels: https://huggingface.co/datasets/miracl/miracl
- Corpus: https://huggingface.co/datasets/miracl/miracl-corpus
Maybe we can just use the original dataset with all provided languages ? Do you prefer duplicating the data into a MTEB data repository to make sure it will always be available for the benchmark, or should we just load the dataset from it's original HF repo and transform it on load ?
Do you prefer duplicating the data into a MTEB data repository to make sure it will always be available for the benchmark, or should we just load the dataset from it's original HF repo and transform it on load ?
Generally, the first option is more robust, but we have multiple datasets atm. which does the second one. The second option is also easier to update (if the dataset is updated).
For MIRACL though I would just go for option 1 (permissive license) as long there are no planned updates.
Hi @KennethEnevoldsen @imenelydiaker @Muennighoff,
@crystina-z and I are co-authors of the MIRACL benchmark.
We just saw the announcement on Twitter/X of building out the multilingual MTEB, and I saw this issue is open.
We would be happy to help you to integrate MIRACL within MMTEB. Please ask us directly if you have any questions.
Thanks, Nandan
Hi @thakur-nandan, very happy to have you guys on board. Seems like MIRACL is partly added in a few different ways (partially for some languages and as a reranking task and as a retrieval task). You guys might be interesting unifying those and adding in the missing languages? If you have the time of course. If you do not I will mark this thread with a "help wanted".
Hi @KennethEnevoldsen sounds good and we'd love to help! I can take the reranking task and @thakur-nandan will handle the retrieval task. We'll start this week and get back as soon as we can.
Wonderful @crystina-z and @thakur-nandan.
Are you still interested in adding this? Would be amazing! 🙌 cc @crystina-z @thakur-nandan @imenelydiaker
@crystina-z @thakur-nandan if you haven't started yet I can take it from here 😊
@imenelydiaker Thanks for your help. Me and @crystina-z have already started to look into both the reranking and retrieval tasks and should have the PR soon!
Regards, Nandan Thakur
Hi all! I just submitted #641 for the reranking part. lmk how you think!
Submitted #642 for the retrieval part. I have not been able to successfully reproduce the mE5-small nDCG@10 numbers.
we are currently waiting for #833, which is being worked on by @imenelydiaker so will add you to this issue as well.