pyserini
pyserini copied to clipboard
FaissSearcher.batch_search() behaves differently with different k
I was doing retrieval with the prebuilt index wikipedia-dpr-multi-bf
, but I found it will return different retrieval list with different k
.
Here is an example:
from pyserini.search.faiss import FaissSearcher
from pyserini.search.faiss.__main__ import init_query_encoder
query_encoder = init_query_encoder('facebook/dpr-question_encoder-multiset-base', None, None, None, None, 'cpu', None)
searcher = FaissSearcher.from_prebuilt_index('wikipedia-dpr-multi-bf', query_encoder)
result1 = searcher.batch_search(['where do they grow hops in the us'], ['26'], 100)
result2 = searcher.batch_search(['where do they grow hops in the us'], ['26'], 1000)
for hits1, hits2 in zip(result1.values(), result2.values()):
for hit1, hit2 in zip(hits1, hits2[:100]):
print(hit1.docid, hit2.docid)
The output is:
194495 194495
194497 194497
18163020 18163020
18163021 18163021
194496 194496
18163022 18163022
7754014 7754014
10259614 10259614
7269387 7269387
5323698 5323698
194503 194503
15906587 15906587
19894927 19894927
13703150 13703150
5323755 5323755
20256491 20256484
20256484 20256491
1383191 1383191
194489 194489
11539557 11539557
14431781 14431781
16848464 16848464
3168309 3168309
18607106 18607106
11539559 11539559
2651184 2651184
10244141 10244141
16848469 16848469
13427943 13427943
19959197 19959197
1012327 1012327
18899400 18899400
13427939 13427939
5749300 5749300
20090627 20090627
6009699 6009699
14232011 14232011
14232010 14232010
418178 418178
194519 194519
17206816 17206816
18163023 18163023
13703151 13703151
6009698 6009698
16647718 16647718
14795977 14795977
18163024 18163024
14951761 14951761
18715225 18715225
605137 605137
11765436 11765436
5486058 5486058
19959193 19959193
9064921 9064921
10311065 10311065
13634762 13634762
17086913 17086913
9850631 9850631
8875007 8875007
8875006 8875006
4052682 4052682
12130417 12130417
19959205 19959205
16351970 16351970
12825882 12825882
16848467 16848467
17420517 17420517
19538756 19538756
13211790 13211790
17334511 17334511
19059401 19059401
19249755 19249755
3168324 3168324
11942386 11942386
19059404 19059404
4930311 4930311
14599551 14599551
3710530 3710530
7229996 7229996
17340603 17340603
13728559 13728559
12100851 12100851
15352181 15352181
596729 596729
1213151 1213151
7269407 7269407
17207909 17207909
13427966 13427966
12130427 12130427
17207958 17207958
2795421 2795421
3425201 3425201
16848463 16848463
2249717 2249717
17622766 17622766
17207979 17207979
15052418 15052418
10203497 10203497
5012005 5012005
17207938 17207938
In line 16-17, the order of 20256491
and 20256484
is different. This example is not the only one.
This will cause different metric values (e.g., Top20
and Top100
) with expected when reproducing experiments.
Could you please explain this behavior?
Hi @zmzhang2000, Could you also print out the scores for the results? I'm not sure if they are due to tie-breaking issue
Thanks for your quick reply!
It seems like they share the same scores.
194495 80.10902 194495 80.10902
194497 79.05869 194497 79.05869
18163020 76.142555 18163020 76.142555
18163021 75.90108 18163021 75.90108
194496 75.15427 194496 75.15427
18163022 75.077324 18163022 75.077324
7754014 74.987 7754014 74.987
10259614 74.72838 10259614 74.72838
7269387 74.68956 7269387 74.68956
5323698 74.64845 5323698 74.64845
194503 74.53991 194503 74.53991
15906587 74.383934 15906587 74.383934
19894927 74.26529 19894927 74.26529
13703150 74.14798 13703150 74.14798
5323755 74.00522 5323755 74.00522
20256491 73.97757 20256484 73.97757
20256484 73.97757 20256491 73.97757
1383191 73.96136 1383191 73.96136
194489 73.93675 194489 73.93675
11539557 73.935005 11539557 73.935005
14431781 73.901535 14431781 73.901535
16848464 73.86478 16848464 73.86478
3168309 73.796364 3168309 73.796364
18607106 73.557365 18607106 73.557365
11539559 73.39557 11539559 73.39557
2651184 73.24763 2651184 73.24763
10244141 73.23715 10244141 73.23715
16848469 73.22422 16848469 73.22422
13427943 73.151505 13427943 73.151505
19959197 73.07906 19959197 73.07906
1012327 73.0388 1012327 73.0388
18899400 72.985344 18899400 72.985344
13427939 72.917854 13427939 72.917854
5749300 72.89821 5749300 72.89821
20090627 72.863235 20090627 72.863235
6009699 72.85675 6009699 72.85675
14232011 72.82599 14232011 72.82599
14232010 72.82599 14232010 72.82599
418178 72.728874 418178 72.728874
194519 72.70442 194519 72.70442
17206816 72.675606 17206816 72.675606
18163023 72.64918 18163023 72.64918
13703151 72.624115 13703151 72.624115
6009698 72.60811 6009698 72.60811
16647718 72.566086 16647718 72.566086
14795977 72.56355 14795977 72.56355
18163024 72.56332 18163024 72.56332
14951761 72.5491 14951761 72.5491
18715225 72.54545 18715225 72.54545
605137 72.544685 605137 72.544685
11765436 72.482956 11765436 72.482956
5486058 72.47974 5486058 72.47974
19959193 72.45922 19959193 72.45922
9064921 72.40175 9064921 72.40175
10311065 72.4015 10311065 72.4015
13634762 72.348 13634762 72.348
17086913 72.347115 17086913 72.347115
9850631 72.34234 9850631 72.34234
8875007 72.3328 8875007 72.3328
8875006 72.31591 8875006 72.31591
4052682 72.312225 4052682 72.312225
12130417 72.2637 12130417 72.2637
19959205 72.172966 19959205 72.172966
16351970 72.1553 16351970 72.1553
12825882 72.14318 12825882 72.14318
16848467 72.13913 16848467 72.13913
17420517 72.13715 17420517 72.13715
19538756 72.12948 19538756 72.12948
13211790 72.1025 13211790 72.1025
17334511 72.094055 17334511 72.094055
19059401 72.08648 19059401 72.08648
19249755 72.06686 19249755 72.06686
3168324 72.04875 3168324 72.04875
11942386 72.04475 11942386 72.04475
19059404 72.0204 19059404 72.0204
4930311 72.00751 4930311 72.00751
14599551 71.98241 14599551 71.98241
3710530 71.97727 3710530 71.97727
7229996 71.95979 7229996 71.95979
17340603 71.953964 17340603 71.953964
13728559 71.93792 13728559 71.93792
12100851 71.92848 12100851 71.92848
15352181 71.924774 15352181 71.924774
596729 71.86498 596729 71.86498
1213151 71.84466 1213151 71.84466
7269407 71.83806 7269407 71.83806
17207909 71.82814 17207909 71.82814
13427966 71.80788 13427966 71.80788
12130427 71.80416 12130427 71.80416
17207958 71.80114 17207958 71.80114
2795421 71.757256 2795421 71.757256
3425201 71.73196 3425201 71.73196
16848463 71.718925 16848463 71.718925
2249717 71.71633 2249717 71.71633
17622766 71.69812 17622766 71.69812
17207979 71.65542 17207979 71.65542
15052418 71.55997 15052418 71.55997
10203497 71.54793 10203497 71.54793
5012005 71.54069 5012005 71.54069
17207938 71.51498 17207938 71.51498
yeah...then I think it's just tie breaking issue. The reason difference topk gives different order probably due to Faiss implementation. A common way to handle this probably sorting the rank by document id when the scores are same. I'll fix this issue in pyserini too.