mteb
mteb copied to clipboard
feat: Add MLQA dataset for CrossLingual Retrieval
I have made an attempt to add MLQA cross lingual retrieval dataset. I've updated the AbsTaskRetrieval
to handle cross lingual datasets. I ran the task on one language pair Arabic and German: the corpus is in Arabic and the questions in German, you can see the results in the uploaded json files for intfloat/multilingual-e5-small
.
I'm still missing this task:
- [x] Better handle the language mapping and loading (I know how to do it, will do tomorrow)
Checklist for adding MMTEB dataset
Reason for dataset addition:
- [x] I have tested that the dataset runs with the
mteb
package. - [x] I have run the following models on the task (adding the results to the pr). These can be run using the
mteb run -m {model_name} -t {task_name}
command.- [x]
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- [x]
intfloat/multilingual-e5-small
- [x]
- [x] I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
- [x] If the dataset is too big (e.g. >2048 examples), considering using
self.stratified_subsampling() under dataset_transform()
- [x] I have filled out the metadata object in the dataset file (find documentation on it here).
- [x] Run tests locally to make sure nothing is broken using
make test
. - [x] Run the formatter to format the code using
make lint
. - [ ] I have added points for my submission to the points folder using the PR number as the filename (e.g.
438.jsonl
).
@isaac-chung this PR is ready for review. I ran both models on one language pair, but a compelete evaluation is coming (takes some time for 49 pairs).
I didn't add the named tuple as discussed here, as it requires to update all CrossLingual tasks. I prefer adding this in another PR with all necessary refactorings.
For the now, the language is handled this way:
eval_langs = {
"eng-Latn_fra-Latn": ["eng-Latn", "fra-Latn"],
...
}
We can call the task on a language pair (e.g., from English to Spanish) like this:
mteb -t MLQARetrieval -m intfloat/multilingual-e5-small -l eng-Latn_spa-Latn
Edit: one test is failing, I'm looking at it
@isaac-chung @Andrian0s I fixed the code to build language pair in this format: lang1-lang2
(see screenshot below), and renamed _eval_monolingual()
in AbsTaskRetrieval
.I think it's ready for merging 🙂
@isaac-chung @Andrian0s I fixed the code to build language pair in this format:
lang1-lang2
(see screenshot below), and renamed_eval_monolingual()
inAbsTaskRetrieval
.I think it's ready for merging 🙂
Looks good. If the tests pass, I would also support merging.
Okay I may have broken something, tests are all good locally.. I'm checking