beir
beir copied to clipboard
Rerank scores lower than vanilla dense IR?
Hi,
I've got a dense IR pipeline running with rerank, for a search engine application. However my rerank scores are lower than just a dense IR run?
msmarco-distilbert-base-v3
ms-marco-electra-base cross encoder
Scores:
Dense IR DenseIR + Re-Rank
2021-11-30 16:48:39 - NDCG@1: 0.3629 2021-11-30 16:56:16 - NDCG@1: 0.3538
2021-11-30 16:48:39 - NDCG@3: 0.5234 2021-11-30 16:56:16 - NDCG@3: 0.5170
2021-11-30 16:48:39 - NDCG@5: 0.5472 2021-11-30 16:56:16 - NDCG@5: 0.5401
2021-11-30 16:48:39 - NDCG@10: 0.5623 2021-11-30 16:56:16 - NDCG@10: 0.5540
2021-11-30 16:48:39 - NDCG@100: 0.5879 2021-11-30 16:56:16 - NDCG@100: 0.5812
2021-11-30 16:48:39 - NDCG@1000: 0.5965 2021-11-30 16:56:16 - NDCG@1000: 0.5812
2021-11-30 16:48:39 - MAP@1: 0.3629 2021-11-30 16:56:16 - MAP@1: 0.3538
2021-11-30 16:48:39 - MAP@3: 0.4844 2021-11-30 16:56:16 - MAP@3: 0.4774
2021-11-30 16:48:39 - MAP@5: 0.4977 2021-11-30 16:56:16 - MAP@5: 0.4903
2021-11-30 16:48:39 - MAP@10: 0.5040 2021-11-30 16:56:16 - MAP@10: 0.4961
2021-11-30 16:48:39 - MAP@100: 0.5090 2021-11-30 16:56:16 - MAP@100: 0.5013
2021-11-30 16:48:39 - MAP@1000: 0.5093 2021-11-30 16:56:16 - MAP@1000: 0.5013
2021-11-30 16:48:39 - Recall@1: 0.3629 2021-11-30 16:56:16 - Recall@1: 0.3538
2021-11-30 16:48:39 - Recall@3: 0.6362 2021-11-30 16:56:16 - Recall@3: 0.6315
2021-11-30 16:48:39 - Recall@5: 0.6932 2021-11-30 16:56:16 - Recall@5: 0.6869
2021-11-30 16:48:39 - Recall@10: 0.7397 2021-11-30 16:56:16 - Recall@10: 0.7297
2021-11-30 16:48:39 - Recall@100: 0.8627 2021-11-30 16:56:16 - Recall@100: 0.8618
2021-11-30 16:48:39 - Recall@1000: 0.9310 2021-11-30 16:56:16 - Recall@1000: 0.8618
2021-11-30 16:48:39 - P@1: 0.3629 2021-11-30 16:56:16 - P@1: 0.3538
2021-11-30 16:48:39 - P@3: 0.2121 2021-11-30 16:56:16 - P@3: 0.2105
2021-11-30 16:48:39 - P@5: 0.1386 2021-11-30 16:56:16 - P@5: 0.1374
2021-11-30 16:48:39 - P@10: 0.0740 2021-11-30 16:56:16 - P@10: 0.0730
2021-11-30 16:48:39 - P@100: 0.0086 2021-11-30 16:56:16 - P@100: 0.0086
2021-11-30 16:48:39 - P@1000: 0.0009 2021-11-30 16:56:16 - P@1000: 0.0009
Any thoughts would be greatly appreciated.
Hi @pablogranolabar,
This is indeed strange. Thanks for sharing these values.
- Could you share on which dataset did you find these numbers? Is this a custom dataset of yours?
- How many documents did you rerank after retrieving using
msmarco-distilbert-base-v3
?
Also, could you try using the ms-marco-MiniLM-L-6-v2
(https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2) model? This is a stronger model compared to the ms-marco-electra-base
.
Kind Regards, Nandan Thakur
Hi @NThakur20, thanks for making your work available and for the speedy reply!
Yes this a custom dataset, a collection of search engine queries such as returning company information for a ticker.
For rerank, I used the default of 100 documents I think it is.
I will check out MiniLM next, thanks for the help!
Hi again @NThakur20, I swapped out the cross encoder with ms-marco-MiniLM-L-6-v2 but I am still getting subpar re-rank scores after the dense IR. Any thoughts?
Hi @pablogranolabar,
Could you manually evaluate the top-k documents (let's say for k=10), and check whether the results are as expected? One reason could be how the test data was annotated?
Could you share a snippet of your pseudocode to check once if everything is working as expected?
Kind Regards, Nandan Thakur
Hi @NThakur20, yes I've experimented with lower k values all the way down to 10 as well as varying batch sizes. Pretty much the same results, rerank scores are across the board lower after dense IR. The dataset is pretty small though, just about 13K search queries and their anticipated results. Think that would be a large factor with this?
And how important would hyperparameter optimization be in this scenario, I've been thinking about putting together an RL environment for that to increase precision which is low but recall and the other two scores are consistently high.
Hi @pablogranolabar, maybe try initially with Elasticsearch as the first step and further rerank top-k using the above-mentioned cross-encoder?
In our publication, we found lexical retrieval + CE rerank combination to work well.
@thakur-nandan I am experimenting BM25 + CE for TREC-NEWS, TREC-COVID, and NQ. However, for TREC-COVID I am getting lower re-ranking performance than BM25 scores after using ms-marco-MiniLM-L-6-v2
as zero-shot re-ranker. Do I have to fine-tune it again ? The table of results in your paper having column BM25+CE contains scores after fine-tuning the MiniLM or zeros-shot performance ?
I just realized after combining title + text combined multi-field text and re-ranking, I was able to reproduced the scores reported in the paper.