mteb icon indicating copy to clipboard operation
mteb copied to clipboard

MIRACL reranking

Open crystina-z opened this issue 9 months ago • 0 comments

Adding MIRACL Reranking as discussed in #198

Checklist for adding MMTEB dataset

Reason for dataset addition:

  • [X] I have tested that the dataset runs with the mteb package.
  • [X] I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
    • [X] sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • [X] intfloat/multilingual-e5-small
  • [X] I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores). (See scores below)
  • [ ] If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • [X] I have filled out the metadata object in the dataset file (find documentation on it here).
  • [X] Run tests locally to make sure nothing is broken using make test. NOTE: make test didn't work for me with error (pytest: error: unrecognized arguments: -n), but pytest --durations=5 (taking off -n auto) passes
  • [X] Run the formatter to format the code using make lint.
  • [x] I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

Thanks for your patience! We finally have the MIRACL-reranking part ready. Here's the detailed scores "reranking" BM25 top-100: (BM25, mDPR and Hybrid are the baselines we had before)

  multilingual-e5-small paraphrase-multilingual-MiniLM-L12-v2 BM25 mDPR Hybrid
ar 0.715 0.413 0.481 0.499 0.525
bn 0.690 0.080 0.508 0.443 0.501
de 0.456 0.361 0.226 0.490 0.408
en 0.531 0.479 0.351 0.394 0.364
es 0.576 0.459 0.319 0.478 0.418
fa 0.528 0.307 0.333 0.480 0.215
fi 0.740 0.541 0.551 0.472 0.602
fr 0.459 0.357 0.183 0.435 0.314
hi 0.582 0.359 0.458 0.383 0.286
id 0.536 0.415 0.449 0.272 0.392
ja 0.596 0.338 0.369 0.439 0.424
ko 0.766 0.427 0.419 0.419 0.483
ru 0.571 0.399 0.334 0.407 0.391
sw 0.603 0.241 0.383 0.299 0.560
te 0.759 0.144 0.494 0.356 0.528
th 0.705 0.445 0.484 0.358 0.517
yo 0.541 0.408 0.406 0.396 0.415
zh 0.434 0.358 0.180 0.512 0.410
avg 0.599 0.363 0.385 0.418 0.431

I noticed that current reranking filters out the queries that have no positive documents in the top-k candidates, which seems to inflate the scores and makes reranking results based on different first-stage systems incomparable. Also the evaluation is based on sklearn, which IIRC yield different ndcg scores from pytrec. Considering these I added another Evaluator just for MIRACL, tho I believe it should be compatible with other Rerank Tasks. Let me know what you think of the new evaluation!

crystina-z avatar May 06 '24 22:05 crystina-z