mteb
mteb copied to clipboard
Adding MIRACL Retrieval
I am adding MIRACL Retrieval as discussed in https://github.com/embeddings-benchmark/mteb/issues/198.
Checklist for adding MMTEB dataset
Reason for dataset addition:
- [x] I have tested that the dataset runs with the
mteb
package. - [x] I have run the following models on the task (adding the results to the pr). These can be run using the
mteb run -m {model_name} -t {task_name}
command.- [ ]
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- [x]
intfloat/multilingual-e5-small
- [ ]
- [x] I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores). See discussion below. I am getting lower eval scores than reported.
- [ ] If the dataset is too big (e.g. >2048 examples), considering using
self.stratified_subsampling() under dataset_transform()
- [x] I have filled out the metadata object in the dataset file (find documentation on it here).
- [x] Run tests locally to make sure nothing is broken using
make test
. - [x] Run the formatter to format the code using
make lint
. - [x] I have added points for my submission to the points folder using the PR number as the filename (e.g.
438.jsonl
).
Thank you for waiting. I have the MIRACL Retrieval nDCG@10 scores ready for the following model: intfloat/multilingual-e5-small
. I achieved much lower scores than reported in the E5 paper - Table 6 (https://arxiv.org/abs/2402.05672). I am running the mContriever model (link) and will update the PR once I have all scores compared against MIRACL 2CR (link).
I was hoping someone could look into the difference in reproduction and find the issue.
MIRACL Dev | Original (Reported) | MTEB (Repro) |
---|---|---|
ar | 0.714 | 0.678 |
bn | 0.682 | 0.672 |
de | - | 0.434 |
en | 0.48 | 0.425 |
es | 0.512 | 0.455 |
fa | 0.533 | 0.467 |
fi | 0.733 | 0.699 |
fr | 0.476 | 0.403 |
hi | 0.552 | 0.510 |
id | 0.507 | 0.473 |
ja | 0.636 | 0.590 |
ko | 0.612 | 0.591 |
ru | 0.591 | 0.542 |
sw | 0.684 | 0.652 |
te | 0.813 | 0.793 |
th | 0.75 | 0.697 |
yo | - | 0.124 |
zh | 0.459 | 0.375 |
Regards, Nandan
I'm not sure whether the languages considered in MIRACL are new to be considered for a bonus.
Nevertheless, I have added 2 points for adding the MIRACL dataset.
Hope it helps!
@imenelydiaker @thakur-nandan I am soon opening an issue about e5 performance reproduction, the issue is (atleast through the retrieval evaluator) that we don't append the correct prompt. I have verified this by observing the input of what goes into model.encode.
@thakur-nandan if u have known results for a non e5 model, can you rerun with that and confirm that the discrepancy is atleast smaller then?
@imenelydiaker @thakur-nandan I am soon opening an issue about e5 performance reproduction, the issue is (atleast through the retrieval evaluator) that we don't append the correct prompt. I have verified this by observing the input of what goes into model.encode.
So you're saying there is an issue with the RetrievalEvaluator
?
@imenelydiaker @thakur-nandan I am soon opening an issue about e5 performance reproduction, the issue is (atleast through the retrieval evaluator) that we don't append the correct prompt. I have verified this by observing the input of what goes into model.encode.
So you're saying there is an issue with the
RetrievalEvaluator
?
Yes. I plan to make a more elaborate check and put this up in the issues with all the necessary information (I am not very available in the next 1.5 days, if it's urgent, feel free to pick it up).
This is also affecting my PR #645 in the same way, multilingual-e5 models underperforms because of that.
@imenelydiaker @thakur-nandan I am soon opening an issue about e5 performance reproduction, the issue is (atleast through the retrieval evaluator) that we don't append the correct prompt. I have verified this by observing the input of what goes into model.encode.
So you're saying there is an issue with the
RetrievalEvaluator
?Yes. I plan to make a more elaborate check and put this up in the issues with all the necessary information (I am not very available in the next 1.5 days, if it's urgent, feel free to pick it up).
This is also affecting my PR #645 in the same way, multilingual-e5 models underperforms because of that.
I guess that if we're not appending the correct information to the evaluator then the issue is not only with E5, but with other models also? It would be nice if you can open an issue with your observation, I'll take a look at it then and try to fix it
@Muennighoff @KennethEnevoldsen @imenelydiaker I found out an issue with the retrieval dataset evaluation is that the query_id
, doc_id
are always explicitly removed if they are the same. This was introduced in BEIR to avoid self-retrieval in Quora and ArguAna. But, this is leading to lower performances on MIRACL.
After including the following changes, I'm running mContriever scores on MIRACL retrieval for all languages and checking them. A quick evaluation on Yoruba I can achieve 0.4182 nDCG@10 with MTEB (original reported: 0.415).
@thakur-nandan is seems like this PR will influence the score of other tasks, which might be problematic for comparisons. @Muennighoff what is the best approach here?
I see two potential solutions:
- Updating the scores on Quora and ArguAna to utilise the new score or do It for MIRACL (this seems problematic for comparison)
- Alternatively solution is to use both score
nDCG@10
andnDCG@10(no self retrieval)
(I believe this approach is best)
I think @thakur-nandan probably knows best how to reconcile it with Quora & ArguAna as he created them? The 2nd approach sounds good to me.
Thanks for checking this PR.
So, the scores will not be affected as the self-retrieval is double-checked during evaluation as well here with the flag ignore_identical_ids
set to True
, which is the desirable way to go.
https://github.com/embeddings-benchmark/mteb/blob/0cf33d73b1f3ff5be1d3689f2aa8abbbe4454c99/mteb/evaluation/evaluators/RetrievalEvaluator.py#L427
Hence, AFAIK we can safely remove the line if corpus_id != query_id:
that is included in the PR @imenelydiaker @KennethEnevoldsen.
I have two suggestions here:
(1) Keep the code as is with ignore_identical_ids=True
but inform users to keep the query_ids and document_ids are distinct from each other, e.g. for MIRACL I pass ignore_identical_ids=False
.
(2) Change the default to ignore_identical_ids=False
, however, make sure to either hard-code it or remind authors to keep changing the ignore_identical_ids=True
for ArguAna and Quora in BEIR.
Since you are the PR reviewers: The veto power lies with you and I'll let you all decide: @Muennighoff @KennethEnevoldsen @imenelydiaker.
Thanks, Nandan
@thakur-nandan I believe option 2 is the desirable option. Though I would not want the user to switch it. Instead, I would a) create two separate scores (one with and one without) or b) allow the argument to be overwritten during dataset construction:
class ArguAna(AbsTaskRetrieval):
ignore_identical_ids=True
metadata = TaskMetadata(
name="ArguAna",
...
)
Either approach you want to implement is fine with me, but I would probably prefer a) (however I will accept either if one is easier to implement go for that one).
@thakur-nandan we would go for option 2 as @KennethEnevoldsen mentioned it, we would love your help on this! 🙂
@thakur-nandan I would love to get this PR merged in as soon as possible. Would you have the time to do this?
Hi @KennethEnevoldsen @imenelydiaker thanks for your suggestions on the topic, I'll start with the 2 (a) suggestion of keeping separate scores for nDCG@10 with and without self-retrieval. I didn't get time recently to have a look at the PR. I will try to get it done by tomorrow's EoD.
Regards, Nandan
Wonderful to hear @thakur-nandan! Will keep an eye out for it such that the review is resolved quickly
@imenelydiaker any chance you can finish up this PR? I have started finishing up #641.
@imenelydiaker any chance you can finish up this PR? I have started finishing up #641.
Will do yes!
@KennethEnevoldsen @imenelydiaker, I just added the nDCG@10 self metric score separately. Feel free to use it and finish the PR. My cycles for this week are limited and I will not be able to finish this PR.
Apologies for the delay!
@KennethEnevoldsen @imenelydiaker, I just added the nDCG@10 self metric score separately. Feel free to use it and finish the PR. My cycles for this week are limited and I will not be able to finish this PR.
Apologies for the delay!
Thank you @thakur-nandan for this great work, we'll finish it up! 🙂
Merging as in https://github.com/embeddings-benchmark/mteb/pull/641.