Question about MLDR Evaluation Metrics in ModernBERT Paper
Hi, I'm working with the MLDR dataset and trying to reproduce the results from the ModernBERT paper. In Table 3, they report an MLDR-EN score of 44.0 for their model, but I'm getting different metrics (for MLDRO_OD):
MRR@10: 0.746 NDCG@10: 0.781 Accuracy@1: 0.670 MAP@10: 0.746 This is after training on MS MARCO and evaluating on MLDR-EN dev set. I'm using the InformationRetrievalEvaluator from sentence-transformers.
Could someone clarify:
Which metric was used for the 44.0 score in the paper? Is there a specific evaluation setup I should be using for MLDR? Thanks in advance!
Hey, I'm so sorry this took us ages to actually get to, tons of plates spinning!
The metric we used is NDCG@10. I'm going to double check our scripts, however, I'm wondering if there might be something wrong with your eval scripts. As a sanity check, I checked common MLDR results, such as the ones reported by BGE-M3:
It looks like a score of ~44 with very moderate training seems more in line with what we'd expect, while 0.781 would make it almost state-of-the-art and better than all dense, specifically trained embedding models!