beir Strange NDCG@10 for Touche-2020 on the BEIR leaderboard

Strange NDCG@10 for Touche-2020 on the BEIR leaderboard

Open thigm85 opened this issue 3 years ago • 6 comments

I noticed that the NDCG@10 for Touche-2020 on the BEIR leaderboard is around 0.60 for elastic bm25.

Is it correct to assume that Touche-2020 is represented by the dataset named "webis-touche2020"? If yes, I just ran the elastic search bm25 for it, and I found NDCG@10 at around 0.35, which is similar to what I got with Vespa.

Any thoughts?

Feb 25 '22 13:02 thigm85

Recently I had the same problem as you. It is linked to the fact that there are two versions of the webis-touche2020, with the newer one being the one used now (around 0.35) and the older one having better scores (0.6). In issue #11 it seems that the older version was kept, but then in issue #40 the new versions seems to have taken over as the default one, making some numbers of the benchmark obsolete (reranking page mostly, sparse and dense seem to be using the new version).

Feb 25 '22 13:02 cadurosar

Got it. Thanks for the reply @cadurosar.

Feb 25 '22 13:02 thigm85

Does the same issue happen with msmarco? I just ran the Elastic Search BM25 with msmarco and the NDCG@10 is around 0.45 instead of the 0.22 as reported on the BEIR leaderboard. Is that correct @NThakur20?

Feb 28 '22 19:02 thigm85

Hi, @thigm85 and @cadurosar,

Yes, the webis-touche authors contacted us with problems in their version v1 dataset. So we kept scores on the v2 version (with no annotation errors). Some scores in the leaderboard might not be changed like @cadurosar mentioned. The leaderboard is getting revamped and soon will have the latest updated scores on it.

Regarding MSMARCO, I think you would have evaluated the test set. That's why probably you get NDCG@10 of around 0.45, however, you should evaluate the dev set instead where you should get an identical score mentioned in the leaderboard.

Kind Regards, Nandan Thakur

Mar 01 '22 19:03 thakur-nandan

Hi @NThakur20, thanks for the clarification. Does the leaderboard use the dev set for all the datasets or only for MS MARCO?

Mar 02 '22 10:03 thigm85

dev set only for MSMARCO, rest are the test sets.

Mar 03 '22 16:03 thakur-nandan

beir beir copied to clipboard

Strange NDCG@10 for Touche-2020 on the BEIR leaderboard

beir
beir copied to clipboard