anserini
anserini copied to clipboard
SearchCollection Tie-Breaking
While building a regression for QA with the DPR Wikipedia 100-word splits corpus, I found that Top-K accuracy might differ in the 4th decimal point depending on the format of the id used in the corpus before indexing and searching. Using a numbered id achieves slightly different scores than using an id of the form "doc_id#segment_id" ex. id:"10" vs id:"9#1".
id format | Natural Questions Test: top_20_accuracy |
---|---|
numbered | 0.6294 |
doc_id#segment_id | 0.6296 |
This seems to be because of how ties are broken in SearchCollection: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/search/SearchCollection.java#L146. With the above example of ids, "10" < "2" < "9#1" lexicographically, but the same document could be assigned either id.