mteb icon indicating copy to clipboard operation
mteb copied to clipboard

Adding MIRACL Retrieval

Open thakur-nandan opened this issue 9 months ago • 11 comments

I am adding MIRACL Retrieval as discussed in https://github.com/embeddings-benchmark/mteb/issues/198.

Checklist for adding MMTEB dataset

Reason for dataset addition:

  • [x] I have tested that the dataset runs with the mteb package.
  • [x] I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
    • [ ] sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • [x] intfloat/multilingual-e5-small
  • [x] I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores). See discussion below. I am getting lower eval scores than reported.
  • [ ] If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • [x] I have filled out the metadata object in the dataset file (find documentation on it here).
  • [x] Run tests locally to make sure nothing is broken using make test.
  • [x] Run the formatter to format the code using make lint.
  • [x] I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

Thank you for waiting. I have the MIRACL Retrieval nDCG@10 scores ready for the following model: intfloat/multilingual-e5-small. I achieved much lower scores than reported in the E5 paper - Table 6 (https://arxiv.org/abs/2402.05672). I am running the mContriever model (link) and will update the PR once I have all scores compared against MIRACL 2CR (link).

I was hoping someone could look into the difference in reproduction and find the issue.

MIRACL Dev Original (Reported) MTEB (Repro)
ar 0.714 0.678
bn 0.682 0.672
de - 0.434
en 0.48 0.425
es 0.512 0.455
fa 0.533 0.467
fi 0.733 0.699
fr 0.476 0.403
hi 0.552 0.510
id 0.507 0.473
ja 0.636 0.590
ko 0.612 0.591
ru 0.591 0.542
sw 0.684 0.652
te 0.813 0.793
th 0.75 0.697
yo - 0.124
zh 0.459 0.375

Regards, Nandan

thakur-nandan avatar May 06 '24 23:05 thakur-nandan

I'm not sure whether the languages considered in MIRACL are new to be considered for a bonus.

Nevertheless, I have added 2 points for adding the MIRACL dataset.

Hope it helps!

thakur-nandan avatar May 06 '24 23:05 thakur-nandan

@imenelydiaker @thakur-nandan I am soon opening an issue about e5 performance reproduction, the issue is (atleast through the retrieval evaluator) that we don't append the correct prompt. I have verified this by observing the input of what goes into model.encode.

@thakur-nandan if u have known results for a non e5 model, can you rerun with that and confirm that the discrepancy is atleast smaller then?

Andrian0s avatar May 07 '24 08:05 Andrian0s

@imenelydiaker @thakur-nandan I am soon opening an issue about e5 performance reproduction, the issue is (atleast through the retrieval evaluator) that we don't append the correct prompt. I have verified this by observing the input of what goes into model.encode.

So you're saying there is an issue with the RetrievalEvaluator?

imenelydiaker avatar May 07 '24 12:05 imenelydiaker

@imenelydiaker @thakur-nandan I am soon opening an issue about e5 performance reproduction, the issue is (atleast through the retrieval evaluator) that we don't append the correct prompt. I have verified this by observing the input of what goes into model.encode.

So you're saying there is an issue with the RetrievalEvaluator?

Yes. I plan to make a more elaborate check and put this up in the issues with all the necessary information (I am not very available in the next 1.5 days, if it's urgent, feel free to pick it up).

This is also affecting my PR #645 in the same way, multilingual-e5 models underperforms because of that.

Andrian0s avatar May 07 '24 12:05 Andrian0s

@imenelydiaker @thakur-nandan I am soon opening an issue about e5 performance reproduction, the issue is (atleast through the retrieval evaluator) that we don't append the correct prompt. I have verified this by observing the input of what goes into model.encode.

So you're saying there is an issue with the RetrievalEvaluator?

Yes. I plan to make a more elaborate check and put this up in the issues with all the necessary information (I am not very available in the next 1.5 days, if it's urgent, feel free to pick it up).

This is also affecting my PR #645 in the same way, multilingual-e5 models underperforms because of that.

I guess that if we're not appending the correct information to the evaluator then the issue is not only with E5, but with other models also? It would be nice if you can open an issue with your observation, I'll take a look at it then and try to fix it

imenelydiaker avatar May 07 '24 13:05 imenelydiaker

@Muennighoff @KennethEnevoldsen @imenelydiaker I found out an issue with the retrieval dataset evaluation is that the query_id, doc_id are always explicitly removed if they are the same. This was introduced in BEIR to avoid self-retrieval in Quora and ArguAna. But, this is leading to lower performances on MIRACL.

After including the following changes, I'm running mContriever scores on MIRACL retrieval for all languages and checking them. A quick evaluation on Yoruba I can achieve 0.4182 nDCG@10 with MTEB (original reported: 0.415).

thakur-nandan avatar May 07 '24 16:05 thakur-nandan

@thakur-nandan is seems like this PR will influence the score of other tasks, which might be problematic for comparisons. @Muennighoff what is the best approach here?

I see two potential solutions:

  1. Updating the scores on Quora and ArguAna to utilise the new score or do It for MIRACL (this seems problematic for comparison)
  2. Alternatively solution is to use both score nDCG@10 and nDCG@10(no self retrieval) (I believe this approach is best)

KennethEnevoldsen avatar May 08 '24 11:05 KennethEnevoldsen

I think @thakur-nandan probably knows best how to reconcile it with Quora & ArguAna as he created them? The 2nd approach sounds good to me.

Muennighoff avatar May 08 '24 15:05 Muennighoff

Thanks for checking this PR.

So, the scores will not be affected as the self-retrieval is double-checked during evaluation as well here with the flag ignore_identical_ids set to True, which is the desirable way to go.

https://github.com/embeddings-benchmark/mteb/blob/0cf33d73b1f3ff5be1d3689f2aa8abbbe4454c99/mteb/evaluation/evaluators/RetrievalEvaluator.py#L427

Hence, AFAIK we can safely remove the line if corpus_id != query_id: that is included in the PR @imenelydiaker @KennethEnevoldsen.

I have two suggestions here:

(1) Keep the code as is with ignore_identical_ids=True but inform users to keep the query_ids and document_ids are distinct from each other, e.g. for MIRACL I pass ignore_identical_ids=False. (2) Change the default to ignore_identical_ids=False, however, make sure to either hard-code it or remind authors to keep changing the ignore_identical_ids=True for ArguAna and Quora in BEIR.

Since you are the PR reviewers: The veto power lies with you and I'll let you all decide: @Muennighoff @KennethEnevoldsen @imenelydiaker.

Thanks, Nandan

thakur-nandan avatar May 10 '24 16:05 thakur-nandan

@thakur-nandan I believe option 2 is the desirable option. Though I would not want the user to switch it. Instead, I would a) create two separate scores (one with and one without) or b) allow the argument to be overwritten during dataset construction:

class ArguAna(AbsTaskRetrieval):
    ignore_identical_ids=True

    metadata = TaskMetadata(
        name="ArguAna",
        ...
        )

Either approach you want to implement is fine with me, but I would probably prefer a) (however I will accept either if one is easier to implement go for that one).

KennethEnevoldsen avatar May 11 '24 13:05 KennethEnevoldsen

@thakur-nandan we would go for option 2 as @KennethEnevoldsen mentioned it, we would love your help on this! 🙂

imenelydiaker avatar May 14 '24 20:05 imenelydiaker

@thakur-nandan I would love to get this PR merged in as soon as possible. Would you have the time to do this?

KennethEnevoldsen avatar May 21 '24 09:05 KennethEnevoldsen

Hi @KennethEnevoldsen @imenelydiaker thanks for your suggestions on the topic, I'll start with the 2 (a) suggestion of keeping separate scores for nDCG@10 with and without self-retrieval. I didn't get time recently to have a look at the PR. I will try to get it done by tomorrow's EoD.

Regards, Nandan

thakur-nandan avatar May 21 '24 13:05 thakur-nandan

Wonderful to hear @thakur-nandan! Will keep an eye out for it such that the review is resolved quickly

KennethEnevoldsen avatar May 21 '24 14:05 KennethEnevoldsen

@imenelydiaker any chance you can finish up this PR? I have started finishing up #641.

KennethEnevoldsen avatar May 27 '24 12:05 KennethEnevoldsen

@imenelydiaker any chance you can finish up this PR? I have started finishing up #641.

Will do yes!

imenelydiaker avatar May 27 '24 13:05 imenelydiaker

@KennethEnevoldsen @imenelydiaker, I just added the nDCG@10 self metric score separately. Feel free to use it and finish the PR. My cycles for this week are limited and I will not be able to finish this PR.

Apologies for the delay!

thakur-nandan avatar May 27 '24 13:05 thakur-nandan

@KennethEnevoldsen @imenelydiaker, I just added the nDCG@10 self metric score separately. Feel free to use it and finish the PR. My cycles for this week are limited and I will not be able to finish this PR.

Apologies for the delay!

Thank you @thakur-nandan for this great work, we'll finish it up! 🙂

imenelydiaker avatar May 27 '24 14:05 imenelydiaker

Merging as in https://github.com/embeddings-benchmark/mteb/pull/641.

imenelydiaker avatar May 27 '24 14:05 imenelydiaker