QueryRescorer should be able to use the original sort on ties
Description
Currently the QueryRescorer uses the following logic for sorting the TopDocs after re-scoring:
Comparator<ScoreDoc> sortDocComparator =
(a, b) -> {
// Sort by score descending, then docID ascending:
if (a.score > b.score) {
return -1;
} else if (a.score < b.score) {
return 1;
} else {
// This subtraction can't overflow int
// because docIDs are >= 0:
return a.doc - b.doc;
}
};
The problem here is the else statement. So documents that have the same score will be stably sorted using their docIds, not their original sort order. So if users were using a complex sort (e.g. score, price desc) then re-scoring, the original sort (price desc) will not be used when there are ties in the new score.
This would be fixed by either removing the return a.doc - b.doc. Maybe even just changing the whole thing to (a,b) -> b.score - a.score.
If there is no original sort, then maybe keeping the original logic above makes sense. Both could be enabled by having QueryRescorer just accept an option to keep original order on ties.
@HoustonPutman When the user has provided a sort criteria then why would the score be calculated at the first place. My understanding is query X says give me results from the shards on the basis of Y sort order. Then the results would be in order of Y sort criteria. Scores won't be calculated? Could you point to a specific use case here?
The sort criteria can include the score. And in this case, we are talking about re-scoring. So consider the original sort is score, price asc, then we use the QueryRescorer. If there are ties in the new score, then we want to keep the original sort criteria of price asc. Or even just use the idea that "if the document was sorted higher originally, is should be sorted higher after rescoring given the new scores are equal". This idea works even just using sort=score.
But really, I'm not sure what's the use case for: sorting on docId for ties
Maybe something like this
private static final Comparator<ScoreDoc> SORTING_TIE_BREAKER = (o1, o2) -> {
int scoreComparison = Double.compare(o1.score, o2.score);
if (scoreComparison != 0) {
return scoreComparison;
}
int docIdComparison = Integer.compare(o1.doc, o2.doc);
if (docIdComparison != 0) {
return docIdComparison;
}
// When duplicate result found then both score and doc ID are equal (o1.score == o2.score && o1.doc == o2.doc) then return 1
return 1;
};
- If sort field value of 2 documents is same then pick the one which has high score.
- If sort field value of 2 documents is same and scores are also same then pick the one which has greater docId
- If all parameters are same then essentially both are same objects, there for choose any
Can two different docs have the same id? I don't think so. Then your example is functionally the same as it is now.
In my mind, it would be:
Comparator<ScoreDoc> sortDocComparator =
Comparator.<ScoreDoc>comparingDouble(sd -> sd.score).reversed();
if (!useStableSort) {
sortDocComparator = sortDocComparator.thenComparingInt(sd -> sd.doc);
}
I just don't think the docId comparison needs to be there. And at the very least, it should be an option to use a stableSort instead of sorting on docId for ties.
I just realized that the first part of the method resorts on docId to start.
So we would need to sort the original-order hits from the firstPassTopDocs passed to the method. Something like this:
Comparator<ScoreDoc> sortDocComparator =
Comparator.<ScoreDoc>comparingDouble(sd -> sd.score).reversed();
if (useStableSort) {
hits = firstPassTopDocs.scoreDocs.clone();
}
Since if useStableSort is false, hits is already presorted by docId, so no need to include that in the sort anyways.
We tie-break on docid so that the sort order is well-defined and stable w.r.t. multiple executions of the same query. Maybe there is some other reason like making index-time sorting more able to enable efficient querying.
At first I thought it made no sense to support a query plan where we initially sort by one complex sort criterion and then later re-sort with another complex sort criterion since we could initially use no scoring, and only after in the rescoring phase, apply the combined sort. But then it occurred to me we might take top N by the first sort and then re-sort those.
I see a problem here though which is that this rescoring is done per-segment, and the initial sort order is only every computed within a segment. We don't know what the relative order of two documents is across different segment until we merge the results, but QueryCollectorRescorer does its work per-segment, so I don't see how we can "preserve" an order that has not yet been determined. So maybe the only way forward here is to run the rescoring single-threaded after merging results from the first query.
Not sure we should continue with this, but this issue is related to https://github.com/apache/lucene/pull/13510 and that should be documented.