beir icon indicating copy to clipboard operation
beir copied to clipboard

Strange NDCG@10 for Touche-2020 on the BEIR leaderboard

Open thigm85 opened this issue 3 years ago • 6 comments

I noticed that the NDCG@10 for Touche-2020 on the BEIR leaderboard is around 0.60 for elastic bm25.

Is it correct to assume that Touche-2020 is represented by the dataset named "webis-touche2020"? If yes, I just ran the elastic search bm25 for it, and I found NDCG@10 at around 0.35, which is similar to what I got with Vespa.

Any thoughts?

thigm85 avatar Feb 25 '22 13:02 thigm85

Recently I had the same problem as you. It is linked to the fact that there are two versions of the webis-touche2020, with the newer one being the one used now (around 0.35) and the older one having better scores (0.6). In issue #11 it seems that the older version was kept, but then in issue #40 the new versions seems to have taken over as the default one, making some numbers of the benchmark obsolete (reranking page mostly, sparse and dense seem to be using the new version).

cadurosar avatar Feb 25 '22 13:02 cadurosar

Got it. Thanks for the reply @cadurosar.

thigm85 avatar Feb 25 '22 13:02 thigm85

Does the same issue happen with msmarco? I just ran the Elastic Search BM25 with msmarco and the NDCG@10 is around 0.45 instead of the 0.22 as reported on the BEIR leaderboard. Is that correct @NThakur20?

thigm85 avatar Feb 28 '22 19:02 thigm85

Hi, @thigm85 and @cadurosar,

Yes, the webis-touche authors contacted us with problems in their version v1 dataset. So we kept scores on the v2 version (with no annotation errors). Some scores in the leaderboard might not be changed like @cadurosar mentioned. The leaderboard is getting revamped and soon will have the latest updated scores on it.

Regarding MSMARCO, I think you would have evaluated the test set. That's why probably you get NDCG@10 of around 0.45, however, you should evaluate the dev set instead where you should get an identical score mentioned in the leaderboard.

Kind Regards, Nandan Thakur

thakur-nandan avatar Mar 01 '22 19:03 thakur-nandan

Hi @NThakur20, thanks for the clarification. Does the leaderboard use the dev set for all the datasets or only for MS MARCO?

thigm85 avatar Mar 02 '22 10:03 thigm85

dev set only for MSMARCO, rest are the test sets.

thakur-nandan avatar Mar 03 '22 16:03 thakur-nandan