beir Rerank scores lower than vanilla dense IR?

Hi,

I've got a dense IR pipeline running with rerank, for a search engine application. However my rerank scores are lower than just a dense IR run?

msmarco-distilbert-base-v3
ms-marco-electra-base cross encoder

Scores:

Dense IR                                        DenseIR + Re-Rank
2021-11-30 16:48:39 - NDCG@1: 0.3629		2021-11-30 16:56:16 - NDCG@1: 0.3538
2021-11-30 16:48:39 - NDCG@3: 0.5234		2021-11-30 16:56:16 - NDCG@3: 0.5170
2021-11-30 16:48:39 - NDCG@5: 0.5472		2021-11-30 16:56:16 - NDCG@5: 0.5401
2021-11-30 16:48:39 - NDCG@10: 0.5623		2021-11-30 16:56:16 - NDCG@10: 0.5540
2021-11-30 16:48:39 - NDCG@100: 0.5879		2021-11-30 16:56:16 - NDCG@100: 0.5812
2021-11-30 16:48:39 - NDCG@1000: 0.5965		2021-11-30 16:56:16 - NDCG@1000: 0.5812

2021-11-30 16:48:39 - MAP@1: 0.3629		2021-11-30 16:56:16 - MAP@1: 0.3538
2021-11-30 16:48:39 - MAP@3: 0.4844		2021-11-30 16:56:16 - MAP@3: 0.4774
2021-11-30 16:48:39 - MAP@5: 0.4977		2021-11-30 16:56:16 - MAP@5: 0.4903		
2021-11-30 16:48:39 - MAP@10: 0.5040		2021-11-30 16:56:16 - MAP@10: 0.4961
2021-11-30 16:48:39 - MAP@100: 0.5090		2021-11-30 16:56:16 - MAP@100: 0.5013
2021-11-30 16:48:39 - MAP@1000: 0.5093		2021-11-30 16:56:16 - MAP@1000: 0.5013

2021-11-30 16:48:39 - Recall@1: 0.3629		2021-11-30 16:56:16 - Recall@1: 0.3538
2021-11-30 16:48:39 - Recall@3: 0.6362		2021-11-30 16:56:16 - Recall@3: 0.6315
2021-11-30 16:48:39 - Recall@5: 0.6932		2021-11-30 16:56:16 - Recall@5: 0.6869
2021-11-30 16:48:39 - Recall@10: 0.7397		2021-11-30 16:56:16 - Recall@10: 0.7297
2021-11-30 16:48:39 - Recall@100: 0.8627	2021-11-30 16:56:16 - Recall@100: 0.8618
2021-11-30 16:48:39 - Recall@1000: 0.9310	2021-11-30 16:56:16 - Recall@1000: 0.8618

2021-11-30 16:48:39 - P@1: 0.3629		2021-11-30 16:56:16 - P@1: 0.3538
2021-11-30 16:48:39 - P@3: 0.2121		2021-11-30 16:56:16 - P@3: 0.2105
2021-11-30 16:48:39 - P@5: 0.1386		2021-11-30 16:56:16 - P@5: 0.1374
2021-11-30 16:48:39 - P@10: 0.0740		2021-11-30 16:56:16 - P@10: 0.0730
2021-11-30 16:48:39 - P@100: 0.0086		2021-11-30 16:56:16 - P@100: 0.0086
2021-11-30 16:48:39 - P@1000: 0.0009		2021-11-30 16:56:16 - P@1000: 0.0009

Any thoughts would be greatly appreciated.

Nov 30 '21 22:11 pablogranolabar

Hi @pablogranolabar,

This is indeed strange. Thanks for sharing these values.

Could you share on which dataset did you find these numbers? Is this a custom dataset of yours?
How many documents did you rerank after retrieving using msmarco-distilbert-base-v3?

Also, could you try using the ms-marco-MiniLM-L-6-v2 (https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2) model? This is a stronger model compared to the ms-marco-electra-base.

Kind Regards, Nandan Thakur

Nov 30 '21 23:11 thakur-nandan

Hi @NThakur20, thanks for making your work available and for the speedy reply!

Yes this a custom dataset, a collection of search engine queries such as returning company information for a ticker.

For rerank, I used the default of 100 documents I think it is.

I will check out MiniLM next, thanks for the help!

Dec 01 '21 00:12 pablogranolabar

Hi again @NThakur20, I swapped out the cross encoder with ms-marco-MiniLM-L-6-v2 but I am still getting subpar re-rank scores after the dense IR. Any thoughts?

Dec 03 '21 16:12 pablogranolabar

Hi @pablogranolabar,

Could you manually evaluate the top-k documents (let's say for k=10), and check whether the results are as expected? One reason could be how the test data was annotated?

Could you share a snippet of your pseudocode to check once if everything is working as expected?

Kind Regards, Nandan Thakur

Dec 03 '21 23:12 thakur-nandan

Hi @NThakur20, yes I've experimented with lower k values all the way down to 10 as well as varying batch sizes. Pretty much the same results, rerank scores are across the board lower after dense IR. The dataset is pretty small though, just about 13K search queries and their anticipated results. Think that would be a large factor with this?

And how important would hyperparameter optimization be in this scenario, I've been thinking about putting together an RL environment for that to increase precision which is low but recall and the other two scores are consistently high.

Dec 04 '21 20:12 pablogranolabar

Hi @pablogranolabar, maybe try initially with Elasticsearch as the first step and further rerank top-k using the above-mentioned cross-encoder?

In our publication, we found lexical retrieval + CE rerank combination to work well.

Dec 14 '21 19:12 thakur-nandan

@thakur-nandan I am experimenting BM25 + CE for TREC-NEWS, TREC-COVID, and NQ. However, for TREC-COVID I am getting lower re-ranking performance than BM25 scores after using ms-marco-MiniLM-L-6-v2 as zero-shot re-ranker. Do I have to fine-tune it again ? The table of results in your paper having column BM25+CE contains scores after fine-tuning the MiniLM or zeros-shot performance ?

May 18 '23 13:05 cramraj8

I just realized after combining title + text combined multi-field text and re-ranking, I was able to reproduced the scores reported in the paper.

May 18 '23 20:05 cramraj8

beir beir copied to clipboard

Rerank scores lower than vanilla dense IR?

beir
beir copied to clipboard