Low BM25 baselines?
Hi there, thanks for providing this nice resource!
Looking at your paper, I think your BM25 baselines are a bit low? You report 0.218 nDCG@10 on MS MARCO, if I'm not mistaken - from Table 2.
With Pyserini https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md - we can get, and this has been widely reproduced:
$ tools/eval/trec_eval.9.0.4/trec_eval -c -mrecall.1000 -mmap -m ndcg_cut.10 collections/msmarco-passage/qrels.dev.small.trec runs/run.msmarco-passage.bm25tuned.trec
map all 0.1957
recall_1000 all 0.8573
ndcg_cut_10 all 0.2340
So, 1.6 points higher?
I suspect all the BM25 results should be a bit higher, based on our experience: https://arxiv.org/abs/2104.05740
For many of the other datasets with dense labels, a competitive baseline - and widely acknowledged in the IR community - would be something like BM25+RM3.
We would be happy to work with you on building out Pyserini as the competitive baseline for this task... Please reach out!