lucene Could `PointRangeQuery`'s boundary values used for `NumericComparator` to calculate `estimatedNumberOfMatches`

Description

Now when we do TopK search, we could rebuild DocIdSetIterator to reduce candidate docs since LUCENE-9280 .

One condition of rebuilding DocIdSetIterator is that it must reduce number of docs at least 8x. But when we do TopK search by PointRangeQuery, it's estimatedNumberOfMatches contains some docs which are out of boundaries.Could we take advantage of range query's boundary values to make this condition much more easier to achieve？

Since LUCENE-10620 we pass Weight to Collecter, it might be able to do this optimization?

Sep 15 '22 03:09 LuXugang

The estimatedNumberOfMatches should still be very close to the actual number, so I'm not expecting that a more precise value would change when we rebuild the DocIdSet of top-k candidates, would it?

Sep 19 '22 15:09 jpountz

The estimatedNumberOfMatches should still be very close to the actual number

Actually estimatedNumberOfMatches may far away from the actual number.

I wrote a test shows documents which are out of query boundary will participate in the calculation of estimatedNumberOfMatches which should not be what we expected.

In that test, 80003 documents were indexed would match PointRangeQuery, and TopFieldCollector will collect different numbers of docs according to the number of documents which are out of query boundary.

number of documents which are out of query boundary	number of hits in Collector
1	1001
1000	1001
10000	1001
20000	80003
100000	80003
100000+	80003

Sep 21 '22 06:09 LuXugang

Thanks, I had not well understood that you were after the case when both the filter and the sort would be on the same field. You are right that the collector could do better by being aware of the query. I suspect that the main challenge with this optimization is going to be to implement it in a clean way. If you have ideas how we could do this, I'd be happy to take a look.

Sep 22 '22 07:09 jpountz