anserini icon indicating copy to clipboard operation
anserini copied to clipboard

SearchCollection Tie-Breaking

Open manveertamber opened this issue 2 years ago • 0 comments

While building a regression for QA with the DPR Wikipedia 100-word splits corpus, I found that Top-K accuracy might differ in the 4th decimal point depending on the format of the id used in the corpus before indexing and searching. Using a numbered id achieves slightly different scores than using an id of the form "doc_id#segment_id" ex. id:"10" vs id:"9#1".

id format Natural Questions Test: top_20_accuracy
numbered 0.6294
doc_id#segment_id 0.6296

This seems to be because of how ties are broken in SearchCollection: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/search/SearchCollection.java#L146. With the above example of ids, "10" < "2" < "9#1" lexicographically, but the same document could be assigned either id.

manveertamber avatar Jul 13 '22 00:07 manveertamber