mteb
mteb copied to clipboard
MIRACL reranking
Adding MIRACL Reranking as discussed in #198
Checklist for adding MMTEB dataset
Reason for dataset addition:
- [X] I have tested that the dataset runs with the
mteb
package. - [X] I have run the following models on the task (adding the results to the pr). These can be run using the
mteb run -m {model_name} -t {task_name}
command.- [X]
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- [X]
intfloat/multilingual-e5-small
- [X]
- [X] I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores). (See scores below)
- [ ] If the dataset is too big (e.g. >2048 examples), considering using
self.stratified_subsampling() under dataset_transform()
- [X] I have filled out the metadata object in the dataset file (find documentation on it here).
- [X] Run tests locally to make sure nothing is broken using
make test
. NOTE:make test
didn't work for me with error (pytest: error: unrecognized arguments: -n), butpytest --durations=5
(taking off-n auto
) passes - [X] Run the formatter to format the code using
make lint
. - [x] I have added points for my submission to the points folder using the PR number as the filename (e.g.
438.jsonl
).
Thanks for your patience! We finally have the MIRACL-reranking part ready. Here's the detailed scores "reranking" BM25 top-100: (BM25, mDPR and Hybrid are the baselines we had before)
multilingual-e5-small | paraphrase-multilingual-MiniLM-L12-v2 | BM25 | mDPR | Hybrid | |
---|---|---|---|---|---|
ar | 0.715 | 0.413 | 0.481 | 0.499 | 0.525 |
bn | 0.690 | 0.080 | 0.508 | 0.443 | 0.501 |
de | 0.456 | 0.361 | 0.226 | 0.490 | 0.408 |
en | 0.531 | 0.479 | 0.351 | 0.394 | 0.364 |
es | 0.576 | 0.459 | 0.319 | 0.478 | 0.418 |
fa | 0.528 | 0.307 | 0.333 | 0.480 | 0.215 |
fi | 0.740 | 0.541 | 0.551 | 0.472 | 0.602 |
fr | 0.459 | 0.357 | 0.183 | 0.435 | 0.314 |
hi | 0.582 | 0.359 | 0.458 | 0.383 | 0.286 |
id | 0.536 | 0.415 | 0.449 | 0.272 | 0.392 |
ja | 0.596 | 0.338 | 0.369 | 0.439 | 0.424 |
ko | 0.766 | 0.427 | 0.419 | 0.419 | 0.483 |
ru | 0.571 | 0.399 | 0.334 | 0.407 | 0.391 |
sw | 0.603 | 0.241 | 0.383 | 0.299 | 0.560 |
te | 0.759 | 0.144 | 0.494 | 0.356 | 0.528 |
th | 0.705 | 0.445 | 0.484 | 0.358 | 0.517 |
yo | 0.541 | 0.408 | 0.406 | 0.396 | 0.415 |
zh | 0.434 | 0.358 | 0.180 | 0.512 | 0.410 |
avg | 0.599 | 0.363 | 0.385 | 0.418 | 0.431 |
I noticed that current reranking filters out the queries that have no positive documents in the top-k candidates, which seems to inflate the scores and makes reranking results based on different first-stage systems incomparable. Also the evaluation is based on sklearn, which IIRC yield different ndcg scores from pytrec. Considering these I added another Evaluator just for MIRACL, tho I believe it should be compatible with other Rerank Tasks. Let me know what you think of the new evaluation!