beir
beir copied to clipboard
Strange NDCG@10 for Touche-2020 on the BEIR leaderboard
I noticed that the NDCG@10 for Touche-2020 on the BEIR leaderboard is around 0.60 for elastic bm25.
Is it correct to assume that Touche-2020 is represented by the dataset named "webis-touche2020"? If yes, I just ran the elastic search bm25 for it, and I found NDCG@10 at around 0.35, which is similar to what I got with Vespa.
Any thoughts?
Recently I had the same problem as you. It is linked to the fact that there are two versions of the webis-touche2020, with the newer one being the one used now (around 0.35) and the older one having better scores (0.6). In issue #11 it seems that the older version was kept, but then in issue #40 the new versions seems to have taken over as the default one, making some numbers of the benchmark obsolete (reranking page mostly, sparse and dense seem to be using the new version).
Got it. Thanks for the reply @cadurosar.
Does the same issue happen with msmarco? I just ran the Elastic Search BM25 with msmarco
and the NDCG@10 is around 0.45 instead of the 0.22 as reported on the BEIR leaderboard. Is that correct @NThakur20?
Hi, @thigm85 and @cadurosar,
Yes, the webis-touche authors contacted us with problems in their version v1 dataset. So we kept scores on the v2 version (with no annotation errors). Some scores in the leaderboard might not be changed like @cadurosar mentioned. The leaderboard is getting revamped and soon will have the latest updated scores on it.
Regarding MSMARCO, I think you would have evaluated the test
set. That's why probably you get NDCG@10 of around 0.45, however, you should evaluate the dev
set instead where you should get an identical score mentioned in the leaderboard.
Kind Regards, Nandan Thakur
Hi @NThakur20, thanks for the clarification. Does the leaderboard use the dev
set for all the datasets or only for MS MARCO?
dev
set only for MSMARCO, rest are the test
sets.