lucene
lucene copied to clipboard
Enabling point range collector always if doc values not indexed
Description
Enabling point range collector if doc values are not indexed
Resolves https://github.com/apache/lucene/issues/14536
I'd rather not do it. My mental model is that this collector works on doc values in the default case, and can opportunistically take advantage of index statistics or points indexes when it makes things faster. So doc values should be required, and index statistics and point indexes should be optional. Plus it would be confusing to me that the same collector would work or fail dependin on the specific query impl that is used, e.g. working with a PointRangeQuery on the same field (thanks to your other PR) but not on a PointRangeQuery on another field that has the same values for every doc.
My mental model is that this collector works on doc values in the default case, and can opportunistically take advantage of index statistics or points indexes when it makes things faster. So doc values should be required, and index statistics and point indexes should be optional.
+1, except that it works today even when doc_values are not indexed and that too in some cases depending on the bucket width (even more confusing as that is strictly implementation detail). Maybe to address both the inconsistencies we have discussed, let us just move the below check before using points index for collection? That ensures, we always throw exception is doc values are not indexed, and optionally use points index if available and conditions are met.
if (fi.getDocValuesType() != DocValuesType.NUMERIC
&& fi.getDocValuesType() != DocValuesType.SORTED_NUMERIC) {
throw new IllegalStateException(
"Expected numeric field, but got doc-value type: " + fi.getDocValuesType());
}
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.
Reran the benchmark, just to ensure nothing is regressed, and collector works as expected:
Benchmark (bucketWidth) (docCount) (pointEnabled) Mode Cnt Score Error Units
HistogramCollectorBenchmark.pointRangeQueryHistogram 5000 500000 true thrpt 3 6457.918 ± 2480.922 ops/s
HistogramCollectorBenchmark.pointRangeQueryHistogram 5000 500000 false thrpt 3 119.723 ± 10.479 ops/s
HistogramCollectorBenchmark.pointRangeQueryHistogram 5000 5000000 true thrpt 3 794.071 ± 78.626 ops/s
HistogramCollectorBenchmark.pointRangeQueryHistogram 5000 5000000 false thrpt 3 12.416 ± 0.423 ops/s
HistogramCollectorBenchmark.pointRangeQueryHistogram 25000 500000 true thrpt 3 14587.655 ± 2973.992 ops/s
HistogramCollectorBenchmark.pointRangeQueryHistogram 25000 500000 false thrpt 3 120.221 ± 10.853 ops/s
HistogramCollectorBenchmark.pointRangeQueryHistogram 25000 5000000 true thrpt 3 3492.358 ± 92.812 ops/s
HistogramCollectorBenchmark.pointRangeQueryHistogram 25000 5000000 false thrpt 3 12.418 ± 0.406 ops/s
@jpountz - As per the suggestion, I have changed the PR to disable HistogramCollection instead if docValues is not indexed. Please let me know if there is any other feedback.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!