QueryRescorer: Use original order by default for same-score items rather than sorting by docId
Description
When using e.g. Solr's ReRankQueryParser hits whose scores are left equal are sorted by their Lucene docId rather than persisting their original order. The docId order is effectively random, while the original order for same-score items may have been deliberate, driven by additional sort clauses.
This change records the original order in hits' shardIndex then uses this index ahead of the docId order to persist the original relative ordering for same-score items.
I'm interested in feedback on this change, and have some open questions:
- This might be considered a breaking change - is the docId order ever likely to be desirable? Should/could this be configurable?
- Does using
shardIndexconflict with other uses of that property? Should this use a new property e.g.originalIndexinstead? - The
testRandomtest fails - how can this be resolved? - Does this introduce a performance issue with the hits being sorted into
docIdorder and back again? - Is the initial sort into
docIdorder necessary? If the first sort can be skipped, I think Java would leave items in their original order if the secondary comparator returned zero for same-score items - is this true? (I'll dig into this.)
Thanks @jpountz! I've switched to a new originalIndex property - the test passes, I'll check this behaves as expected locally too...
It's working as expected when deployed, but I saw that org.apache.lucene.search.TestQueryRescorer.testRandom became unreliable with this change - it failed for about 1 in 8 runs until I updated the test to omit the sort-by-docid and use the original order. I'm wondering how we can make it check this same-score/use-original-order case every time the test runs rather than only at random. I'll have a closer look but can you see any way to achieve this?
I've been exploring constraining maxValue so we're more likely to see same-score items each time - e.g. using 50 in int maxValue = TestUtil.nextInt(random(), 10, 50); rather than a million - does this sound OK to you? I don't know if there are competing reasons to keep the large number there.
I'd rather do this by relying on a stable sort than by tracking an additional variable. It will require doing everything by sorting instead of using a select() first to partition around the top-k-th hit before sorting the first k hits, but it shouldn't be too bad either?
Morning @andywebb1975 and @jpountz - I'm a bit late to the discussion, but I'm interested to be involved!
I'd be happy to look into this with Andy if the rewrite is the way to go; is there any chance this fix could go as-is for a quick patch for now Adrien, and the rewrite follows in a later PR, or is that not the way things are done here? 😀
Just picking this up again (sorry - life, etc!) - I've rebased my branch to pick up recent updates.
@jpountz your comment above sounds to me like undoing the change in LUCENE-8956 - is that what you meant?
(Also FYI @Willdotwhite is a colleague of mine - I was discussing this offline with him.)