lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Enabling point range collector always if doc values not indexed

Open jainankitk opened this issue 7 months ago • 3 comments

Description

Enabling point range collector if doc values are not indexed

Resolves https://github.com/apache/lucene/issues/14536

jainankitk avatar Apr 26 '25 07:04 jainankitk

I'd rather not do it. My mental model is that this collector works on doc values in the default case, and can opportunistically take advantage of index statistics or points indexes when it makes things faster. So doc values should be required, and index statistics and point indexes should be optional. Plus it would be confusing to me that the same collector would work or fail dependin on the specific query impl that is used, e.g. working with a PointRangeQuery on the same field (thanks to your other PR) but not on a PointRangeQuery on another field that has the same values for every doc.

jpountz avatar Apr 28 '25 13:04 jpountz

My mental model is that this collector works on doc values in the default case, and can opportunistically take advantage of index statistics or points indexes when it makes things faster. So doc values should be required, and index statistics and point indexes should be optional.

+1, except that it works today even when doc_values are not indexed and that too in some cases depending on the bucket width (even more confusing as that is strictly implementation detail). Maybe to address both the inconsistencies we have discussed, let us just move the below check before using points index for collection? That ensures, we always throw exception is doc values are not indexed, and optionally use points index if available and conditions are met.

if (fi.getDocValuesType() != DocValuesType.NUMERIC
    && fi.getDocValuesType() != DocValuesType.SORTED_NUMERIC) {
  throw new IllegalStateException(
      "Expected numeric field, but got doc-value type: " + fi.getDocValuesType());
}

jainankitk avatar Apr 28 '25 22:04 jainankitk

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar May 13 '25 00:05 github-actions[bot]

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

github-actions[bot] avatar Aug 27 '25 23:08 github-actions[bot]

Reran the benchmark, just to ensure nothing is regressed, and collector works as expected:

  Benchmark                                             (bucketWidth)  (docCount)  (pointEnabled)   Mode  Cnt      Score      Error  Units                               
  HistogramCollectorBenchmark.pointRangeQueryHistogram           5000      500000            true  thrpt    3   6457.918 ± 2480.922  ops/s                               
  HistogramCollectorBenchmark.pointRangeQueryHistogram           5000      500000           false  thrpt    3    119.723 ±   10.479  ops/s                               
  HistogramCollectorBenchmark.pointRangeQueryHistogram           5000     5000000            true  thrpt    3    794.071 ±   78.626  ops/s                               
  HistogramCollectorBenchmark.pointRangeQueryHistogram           5000     5000000           false  thrpt    3     12.416 ±    0.423  ops/s                               
  HistogramCollectorBenchmark.pointRangeQueryHistogram          25000      500000            true  thrpt    3  14587.655 ± 2973.992  ops/s                               
  HistogramCollectorBenchmark.pointRangeQueryHistogram          25000      500000           false  thrpt    3    120.221 ±   10.853  ops/s                               
  HistogramCollectorBenchmark.pointRangeQueryHistogram          25000     5000000            true  thrpt    3   3492.358 ±   92.812  ops/s                               
  HistogramCollectorBenchmark.pointRangeQueryHistogram          25000     5000000           false  thrpt    3     12.418 ±    0.406  ops/s

jainankitk avatar Aug 28 '25 00:08 jainankitk

@jpountz - As per the suggestion, I have changed the PR to disable HistogramCollection instead if docValues is not indexed. Please let me know if there is any other feedback.

jainankitk avatar Aug 28 '25 00:08 jainankitk

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Sep 12 '25 00:09 github-actions[bot]