mteb icon indicating copy to clipboard operation
mteb copied to clipboard

feat: Add MLQA dataset for CrossLingual Retrieval

Open imenelydiaker opened this issue 10 months ago • 1 comments

I have made an attempt to add MLQA cross lingual retrieval dataset. I've updated the AbsTaskRetrieval to handle cross lingual datasets. I ran the task on one language pair Arabic and German: the corpus is in Arabic and the questions in German, you can see the results in the uploaded json files for intfloat/multilingual-e5-small.

I'm still missing this task:

  • [x] Better handle the language mapping and loading (I know how to do it, will do tomorrow)

Checklist for adding MMTEB dataset

Reason for dataset addition:

  • [x] I have tested that the dataset runs with the mteb package.
  • [x] I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
    • [x] sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • [x] intfloat/multilingual-e5-small
  • [x] I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • [x] If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • [x] I have filled out the metadata object in the dataset file (find documentation on it here).
  • [x] Run tests locally to make sure nothing is broken using make test.
  • [x] Run the formatter to format the code using make lint.
  • [ ] I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

imenelydiaker avatar Apr 24 '24 20:04 imenelydiaker

@isaac-chung this PR is ready for review. I ran both models on one language pair, but a compelete evaluation is coming (takes some time for 49 pairs).

I didn't add the named tuple as discussed here, as it requires to update all CrossLingual tasks. I prefer adding this in another PR with all necessary refactorings.

For the now, the language is handled this way:

eval_langs = {
"eng-Latn_fra-Latn": ["eng-Latn", "fra-Latn"],
...
}

We can call the task on a language pair (e.g., from English to Spanish) like this:

mteb -t MLQARetrieval -m intfloat/multilingual-e5-small -l eng-Latn_spa-Latn

Edit: one test is failing, I'm looking at it

imenelydiaker avatar May 01 '24 21:05 imenelydiaker

@isaac-chung @Andrian0s I fixed the code to build language pair in this format: lang1-lang2 (see screenshot below), and renamed _eval_monolingual() in AbsTaskRetrieval.I think it's ready for merging 🙂 image

imenelydiaker avatar May 04 '24 18:05 imenelydiaker

@isaac-chung @Andrian0s I fixed the code to build language pair in this format: lang1-lang2 (see screenshot below), and renamed _eval_monolingual() in AbsTaskRetrieval.I think it's ready for merging 🙂 image

Looks good. If the tests pass, I would also support merging.

Andrian0s avatar May 04 '24 19:05 Andrian0s

Okay I may have broken something, tests are all good locally.. I'm checking

imenelydiaker avatar May 04 '24 19:05 imenelydiaker