Low BM25 baselines?

Open lintool opened this issue 4 years ago • 0 comments

Hi there, thanks for providing this nice resource!

Looking at your paper, I think your BM25 baselines are a bit low? You report 0.218 nDCG@10 on MS MARCO, if I'm not mistaken - from Table 2.

With Pyserini https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md - we can get, and this has been widely reproduced:

$ tools/eval/trec_eval.9.0.4/trec_eval -c -mrecall.1000 -mmap -m ndcg_cut.10   collections/msmarco-passage/qrels.dev.small.trec runs/run.msmarco-passage.bm25tuned.trec
map                   	all	0.1957
recall_1000           	all	0.8573
ndcg_cut_10           	all	0.2340

So, 1.6 points higher?

I suspect all the BM25 results should be a bit higher, based on our experience: https://arxiv.org/abs/2104.05740

For many of the other datasets with dense labels, a competitive baseline - and widely acknowledged in the IR community - would be something like BM25+RM3.

We would be happy to work with you on building out Pyserini as the competitive baseline for this task... Please reach out!

May 12 '21 14:05 lintool