lucene Made DocIdsWriter use DISI when reading documents with an IntersectVisitor

Instead of calling IntersectVisitor.visit for each doc in the readDelta16 and readInts32 methods, create a DocIdSetIterator and call IntersectVisitor.visit(DocIdSetIterator) instead.

This seems to make Lucene faster at some sorting and range querying tasks - I saw 35-45% reduction in execution time. In learnt this through running this benchmark setup by Elastic: https://github.com/elastic/elasticsearch-opensearch-benchmark.

The hypothesis is that this is due to fewer virtual calls being made - once per BKD leaf, instead of once per document. Note that this is only measurable if the readInts methods have been called with at least 3 implementation of the IntersectVisitor interface - otherwise the JIT inlining takes away the virtual call. In real life Lucene deployments, I would judge that it is very likely that at least 3 implementations are used. For more details on method etc, there are details in this blog post: https://blunders.io/posts/es-benchmark-4-inlining

I tried benchmarking this with luceneutil, but did not see any significant change with the default benchmark - I suspect that I'm using the wrong luceneutil tasks to see any major difference. Which luceneutils benchmarks should I be using for these changes?

Mar 01 '24 09:03 antonha

Thanks for the approval!

I do however think that it would be good to prove this performance improvement in luceneutil before merging, to make the benchmark more easily reproducible than the benchmarking I have done. I have started looking into it, but will have more time for it next week. Does that sound reasonable? The only downside I see is a later merge of this PR.

@antonha could you add a CHANGES.txt entry explaining the optimization? Thanks!

Yes, will do once I have run some luceneutil benchmarks (if we want to wait for that)

Mar 09 '24 07:03 antonha

I tried benchmarking this with luceneutil, but did not see any significant change with the default benchmark - I suspect that I'm using the wrong luceneutil tasks to see any major difference. Which luceneutils benchmarks should I be using for these changes?

Confusingly named, the IntsNRQ task is the only task using points, I think. It runs simple 1D range queries.

The geospatial benchmarks (a different benchmark than the normal/nightly luceneutil) use multi-dimensional points.

Still, given that you saw gains in the "OpenSearch vs Elasticsearch" benchmarks, even if the results are flat with the existing luceneutil benchmarks, I think we should just merge the change. The night after we can watch Lucene's nightly benchmarks and see if the nightly box measured anything.

(Hmm, curiously/separately, it looks like something caused a jump in some geo tasks' performance e.g. distance filter ... I'll try to find the cause and add an annotation!).

Mar 10 '24 21:03 mikemccand

I spent some time "proving" this is luceneutil - luceneutil/pull/257 adds a reproduction - if run with wikimediumall, optimize = True for indexing and commitPoint = 'single'.

The reason that the larger segment is needed is that Lucene otherwise chooses to store document ids in the BKD leaves as int24, meaning that the optimization in this PR does nothing. With the luceneutil changes, I get the following output when comparing this PR to master:

                           TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                    OrHighNotLow      527.76      (7.9%)      516.38      (8.0%)   -2.2% ( -16% -   14%) 0.392
                    OrHighNotMed      523.37      (7.3%)      512.63      (7.5%)   -2.1% ( -15% -   13%) 0.384
                   OrHighNotHigh      418.72      (6.4%)      411.39      (6.3%)   -1.8% ( -13% -   11%) 0.384
           BrowseMonthTaxoFacets       11.50     (22.5%)       11.32     (26.2%)   -1.6% ( -41% -   60%) 0.839
                   OrNotHighHigh      419.16      (5.4%)      412.88      (5.4%)   -1.5% ( -11% -    9%) 0.381
                     AndHighHigh       43.95      (5.4%)       43.42      (3.8%)   -1.2% (  -9% -    8%) 0.413
            HighIntervalsOrdered        4.60      (5.6%)        4.55      (6.0%)   -1.1% ( -12% -   11%) 0.561
               HighTermMonthSort    13015.73      (3.3%)    12889.39      (4.0%)   -1.0% (  -7% -    6%) 0.401
                          Fuzzy1      285.37      (3.3%)      282.87      (2.2%)   -0.9% (  -6% -    4%) 0.320
        AndHighHighDayTaxoFacets        9.00      (4.6%)        8.92      (4.8%)   -0.8% (  -9% -    8%) 0.577
                    HighSpanNear        4.59      (2.8%)        4.55      (2.8%)   -0.8% (  -6% -    4%) 0.356
                    OrNotHighLow     1091.32      (2.8%)     1082.56      (3.3%)   -0.8% (  -6% -    5%) 0.402
                     LowSpanNear       12.72      (2.4%)       12.64      (2.5%)   -0.7% (  -5% -    4%) 0.377
            HighTermTitleBDVSort        5.69      (2.4%)        5.65      (2.5%)   -0.7% (  -5% -    4%) 0.387
                          Fuzzy2      165.14      (2.4%)      164.10      (1.8%)   -0.6% (  -4% -    3%) 0.343
                     MedSpanNear       12.61      (1.9%)       12.53      (2.2%)   -0.6% (  -4% -    3%) 0.327
                        HighTerm      578.67      (6.5%)      575.09      (5.1%)   -0.6% ( -11% -   11%) 0.739
            MedTermDayTaxoFacets       19.91      (5.1%)       19.79      (4.3%)   -0.6% (  -9% -    9%) 0.683
          OrHighMedDayTaxoFacets        3.43     (12.0%)        3.41     (11.4%)   -0.5% ( -21% -   25%) 0.885
                         MedTerm      748.03      (5.6%)      744.26      (4.5%)   -0.5% ( -10% -   10%) 0.755
         AndHighMedDayTaxoFacets       32.72      (1.5%)       32.57      (1.6%)   -0.5% (  -3% -    2%) 0.335
                         LowTerm      526.39      (2.7%)      524.40      (2.7%)   -0.4% (  -5% -    5%) 0.653
                HighSloppyPhrase        9.56      (2.3%)        9.53      (2.9%)   -0.3% (  -5% -    5%) 0.689
                      HighPhrase       33.59      (3.8%)       33.49      (3.8%)   -0.3% (  -7% -    7%) 0.803
             MedIntervalsOrdered       14.95      (4.2%)       14.92      (4.4%)   -0.2% (  -8% -    8%) 0.876
       BrowseDayOfYearSSDVFacets        6.37      (1.4%)        6.36      (1.3%)   -0.2% (  -2% -    2%) 0.665
                 MedSloppyPhrase       15.68      (1.6%)       15.65      (2.8%)   -0.2% (  -4% -    4%) 0.799
                      AndHighMed      109.07      (3.9%)      108.94      (3.2%)   -0.1% (  -6% -    7%) 0.915
                 LowSloppyPhrase       12.64      (1.5%)       12.63      (2.2%)   -0.1% (  -3% -    3%) 0.855
                    OrNotHighMed      377.59      (3.0%)      377.64      (3.3%)    0.0% (  -6% -    6%) 0.989
           BrowseMonthSSDVFacets        6.65      (0.4%)        6.65      (0.4%)    0.0% (   0% -    0%) 0.829
                       OrHighLow      541.18      (3.3%)      541.47      (3.3%)    0.1% (  -6% -    6%) 0.960
             LowIntervalsOrdered       12.42      (4.0%)       12.43      (3.9%)    0.1% (  -7% -    8%) 0.948
     BrowseRandomLabelSSDVFacets        5.35      (1.3%)        5.36      (1.1%)    0.1% (  -2% -    2%) 0.809
                       MedPhrase       47.26      (2.3%)       47.31      (2.4%)    0.1% (  -4% -    4%) 0.889
                         Prefix3      246.21      (4.9%)      246.48      (7.3%)    0.1% ( -11% -   12%) 0.956
                       LowPhrase       34.24      (2.5%)       34.30      (2.4%)    0.2% (  -4% -    5%) 0.824
               HighTermTitleSort       98.67      (2.9%)       98.84      (4.6%)    0.2% (  -7% -    7%) 0.884
                      OrHighHigh       42.99      (8.8%)       43.10      (8.6%)    0.3% ( -15% -   19%) 0.924
            BrowseDateSSDVFacets        2.24     (13.7%)        2.25     (13.2%)    0.6% ( -23% -   31%) 0.895
                         Respell      255.75      (1.7%)      257.22      (1.7%)    0.6% (  -2% -    4%) 0.281
       BrowseDayOfYearTaxoFacets        7.39      (1.9%)        7.44      (4.1%)    0.7% (  -5% -    6%) 0.489
                      AndHighLow      961.38      (3.8%)      970.95      (4.1%)    1.0% (  -6% -    9%) 0.427
            BrowseDateTaxoFacets        7.08      (1.3%)        7.16      (4.6%)    1.1% (  -4% -    7%) 0.291
                      TermDTSort       77.89      (7.4%)       78.84      (4.8%)    1.2% ( -10% -   14%) 0.535
                        PKLookup      250.25      (3.3%)      253.54      (2.7%)    1.3% (  -4% -    7%) 0.169
           HighTermDayOfYearSort      107.64      (2.6%)      109.12      (2.1%)    1.4% (  -3% -    6%) 0.064
                       OrHighMed      109.62      (6.2%)      111.24      (6.1%)    1.5% ( -10% -   14%) 0.447
     BrowseRandomLabelTaxoFacets        6.17      (5.0%)        6.27      (3.9%)    1.6% (  -6% -   11%) 0.253
                        Wildcard      183.93      (3.7%)      187.03      (4.4%)    1.7% (  -6% -   10%) 0.193
                         LongNRQ       42.83      (6.1%)       83.47     (34.5%)   94.9% (  51% -  144%) 0.000
                          IntNRQ       20.29      (3.4%)       53.24     (34.2%)  162.4% ( 120% -  207%) 0.000

The interesting parts here is in the bottom two lines - IntNRQ and LongNRQ becomes much faster. it might be that I messed up the "minimal" part of reproduction, maybe all that was needed is the single-segment and taskCountPerCat increase.

Regardless, it looks promising - a 94% to 162% increase in QPS for range queries with this PR in the slightly modified benchmark.

Mar 13 '24 06:03 antonha

Would we be subject to the same issue if/when 3+ different implementations of DocIdSetIterator get used in IntersectVisitor#visit?

Mar 13 '24 15:03 jpountz

Would we be subject to the same issue if/when 3+ different implementations of DocIdSetIterator get used in IntersectVisitor#visit?

Yes.

Your question makes me think that maybe we should not add a new DocIdSetIterator for this PR - maybe that was your thought as well?

Mar 13 '24 17:03 antonha

Relatedly indeed, I was wondering if the API should expose an IntsRef or something like that, so that there is a single virtual call per block of doc IDs anyway (IntsRef cannot be extended, it's a single impl).

Mar 13 '24 17:03 jpountz

@jpountz I had a quick look at the code, and it seems to me like there are, with this PR, only two implementations used for the DISI used for the IntersectVisitor#visit method - which is good in the sense that it would probably be fine to merge it now, but bad in the sense that it would make adding a third hurt performance.

Do we believe that we can merge this PR and then continue with changing the BKD visit API in a later change, or should we try to change the abstraction in this PR?

Mar 13 '24 22:03 antonha

++ on progress over perfection

That said, I wonder if this change is legal: DocIdSetIterator must return doc IDs in order, but it looks like it wouldn't always be the case with your change?

Mar 14 '24 14:03 jpountz

That said, I wonder if this change is legal: DocIdSetIterator must return doc IDs in order, but it looks like it wouldn't always be the case with your change?

You are correct - I had (falsely) assumed that the document ids in the DocIdsWriter were written in order - thus the PR as of now (dacb2a5d3ad707a555ed1fa2925a082d81aa4443) is not a legal change.

The good news is that this should not have affected the benchmarks - neither the PointRangeQuery nor the NumericLeafComparator seems to rely on the DISI being ordered. They just iterate until DocIdSetIterator.NO_MORE_DOCS.

I will try adding a visit() method taking an IntsRef (I believe that is what you meant @jpountz?).

Mar 15 '24 09:03 antonha

I've changed the readInts16/34 methods to now use the IntsRef instead - I kind of like how it turned out, but the downside is that we will have to add many implementations to see the benefits. I've added the 2 in the PointRangeQuery - if we want to continue down this path, I can add more.

One thing that I would love input on is whether or not to manually inline the visit(docid) version or not. E.g. I now do:

for (int i = ref.offset; i < ref.offset + ref.length; i++) {
  result.clear(ref.ints[i]);
  cost[0]--;
 }

While we could do

for (int i = ref.offset; i < ref.offset + ref.length; i++) {
  visit(ref.ints[i]);
 }

The latter would, most likely, be inlined by the JVM. The first is manually inlined, so we don't need to trust the JVM on it. Thoughts?

It is important to note that just having the default method in IntersectVisitor would not be good, since that would do virtual calls to visit(int). Copy-pasting that method body into each implementation would however, most likely, inline better.

Mar 15 '24 11:03 antonha

I benchmarked the current commit (e826696fd235cb1ef215745944cac6e34faa0d20) using the same modified lucenebench as above, it seems to like this change even better:

TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
LongNRQ       41.11      (6.6%)      107.13     (13.4%)  160.6% ( 131% -  193%) 0.000
 IntNRQ       23.23      (7.0%)       85.58     (19.0%)  268.4% ( 226% -  316%) 0.000

(the baseline being master)

Mar 15 '24 18:03 antonha

I will try adding a visit() method taking an IntsRef (I believe that is what you meant @jpountz?).

This is what I meant indeed.

Before merging I'd be curious to better understand why the JVM doesn't optimize this better. Presumably, it should be able to resolve the virtual call once for the entire for loop rather than doing it again on every iteration? I wonder if there is actually a performance bug, or if we are just insufficiently warming up the JVM and this for loop never gets compiled through C2?

Mar 19 '24 08:03 jpountz

Before merging I'd be curious to better understand why the JVM doesn't optimize this better. Presumably, it should be able to resolve the virtual call once for the entire for loop rather than doing it again on every iteration? I wonder if there is actually a performance bug, or if we are just insufficiently warming up the JVM and this for loop never gets compiled through C2?

Valid questions.

I ran the same luceneutils benchmark with -XX:+LogCompilation and -XX:PrintInlining and I can see c2/level4 compilations being done, including BKDPointTree::addAll and DocIdsWriter::readInts32. The Elasticsearch benchmarks mentioned in my post (where we saw similar results) were also run long enough that I would be very surprised if c2 had not kicked in.

I also tried reproducing this behavior in a shorter java program for demonstration purposes - with virtual calls similar to the IntersectVisitor. The program has 3 implementations of an interface processing int values, with both single value and batch implementations:

import java.util.Random;

public class C2Inlining {
  static int ITERATIONS = 50_000;
  static int NUM_VALUES = 100_000;
  public static void main(String[] args) {
    //Generate ints
    Random r = new Random();
    int[] arr = new int[NUM_VALUES];
    for (int i = 0; i < NUM_VALUES; i++) {
      arr[i] = r.nextInt();
    }

    IntProcessor[] intProcessors = {
        new IntProcessor1(),
        new IntProcessor2(),
        //Comment this last one out to trigger bimorphic behaviour
        new IntProcessor3()
    };
    processOneByOne(intProcessors, arr);
    processBatch(intProcessors, arr);
  }

  private static void processOneByOne(IntProcessor[] intProcessors, int[] arr) {
    long start = System.nanoTime();
    for (int i = 0; i < ITERATIONS; i++) {
      for (IntProcessor intProcessor : intProcessors) {
        for (int value : arr) {
          intProcessor.process(value);
        }
      }
    }
    long end = System.nanoTime();
    long took = end - start;
    System.out.printf("One-by-one: Time per iteration: %.3f ms%n", (((double) took) / ITERATIONS) / 1_000_000d);
  }

  private static void processBatch(IntProcessor[] intProcessors, int[] arr) {
    long start = System.nanoTime();
    for (int i = 0; i < ITERATIONS; i++) {
      for (IntProcessor intProcessor : intProcessors) {
        intProcessor.process(arr);
      }
    }
    long end = System.nanoTime();
    long took = end - start;
    System.out.printf("Batch: Time per iteration: %.3f ms%n", (((double) took) / ITERATIONS) / 1_000_000d);
  }

  interface IntProcessor {
    void process(int i);
    void process(int[] arr);
  }

  static class IntProcessor1 implements IntProcessor {

    static int value;

    @Override
    public void process(int i) {
      value = i;
    }

    @Override
    public void process(int[] arr) {
      for (int i = 0; i < arr.length; i++) {
        value = arr[i];
      }
    }
  }

  static class IntProcessor2 implements IntProcessor {
    static int value;

    @Override
    public void process(int i) {
      value = i;
    }

    @Override
    public void process(int[] arr) {
      for (int i = 0; i < arr.length; i++) {
        value = arr[i];
      }
    }
  }

  static class IntProcessor3 implements IntProcessor {
    static int value;

    @Override
    public void process(int i) {
      value = i;
    }

    @Override
    public void process(int[] arr) {
      for (int i = 0; i < arr.length; i++) {
        value = arr[i];
      }
    }
  }
}

I ran this in this form, and with one implementation commented out (to trigger the bimorphic inlining behavior). The timing results are quite extreme (running with jdk21 and -server)

Variant	One-by-One	Batch
Three implementations	1.101 ms/iter	0.018 ms/iter
Two implementations	0.198 ms/iter	0.012 ms/iter

I.e. with three implementations, batching is ~60 times faster than single implementations. With two, it is ~16 times faster. Both versions saw c2 compilations.

This is the same pattern that we see with the lucene code - although the difference is much more extreme (I'm guessing due to the implementation). I do think that we can draw the conclusion that helping the JVM with batch versions of virtual calls can help performance significantly.

One might argue that the JVM should be able to figure this out on it's own. But until it does, let's maybe help it a bit?

Mar 24 '24 21:03 antonha

Ideally we'd add a query that adds another implementation of an IntersectVisitor to the nightly benchmarks before merging that PR so that we can see the performance bump?

Yes - this would be ideal imo. Some caveats that I found while testing:

PointRangeQuerys are executed with different IntersectVisitors depending on expected number of matches, which means that several query intervals need to be used in each JVM. In luceneutil, I believe that this corresponds to setting taskCountPerCat high.
To see significant change, the number of documents in the segments needs to either fit into the 16bit or 32bit implementation of bkd.DocIdsWriter - this was not the case with the simple localrun.py script out of the box.
It is possible that we see a performance decrease when we add more IntersectVisitor variants, since the benchmark becomes more realistically slow with the additional implementations.

Mar 26 '24 22:03 antonha

Do we believe that we can merge this PR and then continue with changing the BKD visit API in a later change, or should we try to change the abstraction in this PR?

+1 -- PNP!

In luceneutil, I believe that this corresponds to setting taskCountPerCat high.

The nightly benchy uses taskCountPerCat=5 -- is that enough to reveal the impact of this optimization?

But, the nightly benchy does not forceMerge down to one segment, so, we may not have the right number of docs to use 16 or 32 bit int encoding in the BKD file.

Actually, I don't think we should block this change on fixing nightly benchy to reveal its gains? Your above luceneutil repro is sufficient? I'm wary of allowing benchmarking to unduly hold up otherwise good Lucene changes in general... it's a helpful tool, not a crutch ;)

Mar 28 '24 12:03 mikemccand

Thank you all for helping to get this PR into a better state. The only thing that still irked me is that I couldn't get the tests to execute all newly added methods. I made a few changes in the last commit to:

Add a new test case which calls doTestRandomLongs with 20 000 - without this I couldn't get the IntsRef to trigger often enough. This should maybe be a @Nightly?
Added a higher chance of having 0 missing values in doTestRandomLongs - otherwise the inverse intersect visitor would only have a chance of happening every 100 runs, which I thought a bit low.

I would love it if you, @mikemccand or @jpountz, could give a thumbs up to these last changes? After this, what's the protocol? Do I merge the PR myself or should a committer push the button?

Mar 30 '24 22:03 antonha

Add a new test case which calls doTestRandomLongs with 20 000 - without this I couldn't get the IntsRef to trigger often enough. This should maybe be a @Nightly?

How long does the test take to run? It'd be nice to exercise the optimized path in "ordinary" (non-nightly) test runs too ...

I would love it if you, @mikemccand or @jpountz, could give a thumbs up to these last changes?

I'll have a look!

Do I merge the PR myself or should a committer push the button?

One of we committers must merge it! It sounds like we are super close ... I'll try to review today and maybe merge.

Apr 01 '24 14:04 mikemccand

Add a new test case which calls doTestRandomLongs with 20 000 - without this I couldn't get the IntsRef to trigger often enough. This should maybe be a @Nightly?

How long does the test take to run? It'd be nice to exercise the optimized path in "ordinary" (non-nightly) test runs too ...

I timed this a bit ... looks like it's ~.25 - .5 seconds. I think that's OK to run always.

Apr 01 '24 14:04 mikemccand

One of we committers must merge it! It sounds like we are super close ... I'll try to review today and maybe merge.

Sounds great - the javadoc fix is done. Thanks a lot for having a look and timing the test!

Apr 01 '24 16:04 antonha

Thanks @antonha -- sorry, maybe also add a CHANGES.txt entry? This is an exciting opto!

Apr 01 '24 16:04 mikemccand

Thanks @antonha -- sorry, maybe also add a CHANGES.txt entry? This is an exciting opto!

My bad - I should have added one the first time you asked :see_no_evil:. I've Added one now, let me know if you feel like it lacks something!

Apr 01 '24 18:04 antonha

OK I merged a backported to 9.11.0 -- I think that's safe: we added a new default method to IntersectVisitor.

Apr 02 '24 14:04 mikemccand

lucene lucene copied to clipboard

Made DocIdsWriter use DISI when reading documents with an IntersectVisitor

lucene
lucene copied to clipboard