Could `PointRangeQuery`'s boundary values used for `NumericComparator` to calculate `estimatedNumberOfMatches`
Description
Now when we do TopK search, we could rebuild DocIdSetIterator to reduce candidate docs since LUCENE-9280 .
One condition of rebuilding DocIdSetIterator is that it must reduce number of docs at least 8x. But when we do TopK search by PointRangeQuery, it's estimatedNumberOfMatches contains some docs which are out of boundaries.Could we take advantage of range query's boundary values to make this condition much more easier to achieve?
Since LUCENE-10620 we pass Weight to Collecter, it might be able to do this optimization?
The estimatedNumberOfMatches should still be very close to the actual number, so I'm not expecting that a more precise value would change when we rebuild the DocIdSet of top-k candidates, would it?
The estimatedNumberOfMatches should still be very close to the actual number
Actually estimatedNumberOfMatches may far away from the actual number.
I wrote a test shows documents which are out of query boundary will participate in the calculation of estimatedNumberOfMatches which should not be what we expected.
In that test, 80003 documents were indexed would match PointRangeQuery, and TopFieldCollector will collect different numbers of docs according to the number of documents which are out of query boundary.
| number of documents which are out of query boundary | number of hits in Collector |
|---|---|
| 1 | 1001 |
| 1000 | 1001 |
| 10000 | 1001 |
| 20000 | 80003 |
| 100000 | 80003 |
| 100000+ | 80003 |
Thanks, I had not well understood that you were after the case when both the filter and the sort would be on the same field. You are right that the collector could do better by being aware of the query. I suspect that the main challenge with this optimization is going to be to implement it in a clean way. If you have ideas how we could do this, I'd be happy to take a look.