lucene
lucene copied to clipboard
Made DocIdsWriter use DISI when reading documents with an IntersectVisitor
Instead of calling IntersectVisitor.visit
for each doc in the readDelta16
and readInts32
methods, create a DocIdSetIterator
and call IntersectVisitor.visit(DocIdSetIterator)
instead.
This seems to make Lucene faster at some sorting and range querying tasks - I saw 35-45% reduction in execution time. In learnt this through running this benchmark setup by Elastic: https://github.com/elastic/elasticsearch-opensearch-benchmark.
The hypothesis is that this is due to fewer virtual calls being made - once per BKD leaf, instead of once per document. Note that this is only measurable if the readInts methods have been called with at least 3 implementation of the IntersectVisitor interface - otherwise the JIT inlining takes away the virtual call. In real life Lucene deployments, I would judge that it is very likely that at least 3 implementations are used. For more details on method etc, there are details in this blog post: https://blunders.io/posts/es-benchmark-4-inlining
I tried benchmarking this with luceneutil, but did not see any significant change with the default benchmark - I suspect that I'm using the wrong luceneutil tasks to see any major difference. Which luceneutils benchmarks should I be using for these changes?
Thanks for the approval!
I do however think that it would be good to prove this performance improvement in luceneutil before merging, to make the benchmark more easily reproducible than the benchmarking I have done. I have started looking into it, but will have more time for it next week. Does that sound reasonable? The only downside I see is a later merge of this PR.
@antonha could you add a
CHANGES.txt
entry explaining the optimization? Thanks!
Yes, will do once I have run some luceneutil benchmarks (if we want to wait for that)
I tried benchmarking this with luceneutil, but did not see any significant change with the default benchmark - I suspect that I'm using the wrong luceneutil tasks to see any major difference. Which luceneutils benchmarks should I be using for these changes?
Confusingly named, the IntsNRQ
task is the only task using points, I think. It runs simple 1D range queries.
The geospatial benchmarks (a different benchmark than the normal/nightly luceneutil
) use multi-dimensional points.
Still, given that you saw gains in the "OpenSearch vs Elasticsearch" benchmarks, even if the results are flat with the existing luceneutil benchmarks, I think we should just merge the change. The night after we can watch Lucene's nightly benchmarks and see if the nightly box measured anything.
(Hmm, curiously/separately, it looks like something caused a jump in some geo tasks' performance e.g. distance filter ... I'll try to find the cause and add an annotation!).
I spent some time "proving" this is luceneutil - luceneutil/pull/257 adds a reproduction - if run with wikimediumall
, optimize = True
for indexing and commitPoint = 'single'
.
The reason that the larger segment is needed is that Lucene otherwise chooses to store document ids in the BKD leaves as int24, meaning that the optimization in this PR does nothing. With the luceneutil changes, I get the following output when comparing this PR to master:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrHighNotLow 527.76 (7.9%) 516.38 (8.0%) -2.2% ( -16% - 14%) 0.392
OrHighNotMed 523.37 (7.3%) 512.63 (7.5%) -2.1% ( -15% - 13%) 0.384
OrHighNotHigh 418.72 (6.4%) 411.39 (6.3%) -1.8% ( -13% - 11%) 0.384
BrowseMonthTaxoFacets 11.50 (22.5%) 11.32 (26.2%) -1.6% ( -41% - 60%) 0.839
OrNotHighHigh 419.16 (5.4%) 412.88 (5.4%) -1.5% ( -11% - 9%) 0.381
AndHighHigh 43.95 (5.4%) 43.42 (3.8%) -1.2% ( -9% - 8%) 0.413
HighIntervalsOrdered 4.60 (5.6%) 4.55 (6.0%) -1.1% ( -12% - 11%) 0.561
HighTermMonthSort 13015.73 (3.3%) 12889.39 (4.0%) -1.0% ( -7% - 6%) 0.401
Fuzzy1 285.37 (3.3%) 282.87 (2.2%) -0.9% ( -6% - 4%) 0.320
AndHighHighDayTaxoFacets 9.00 (4.6%) 8.92 (4.8%) -0.8% ( -9% - 8%) 0.577
HighSpanNear 4.59 (2.8%) 4.55 (2.8%) -0.8% ( -6% - 4%) 0.356
OrNotHighLow 1091.32 (2.8%) 1082.56 (3.3%) -0.8% ( -6% - 5%) 0.402
LowSpanNear 12.72 (2.4%) 12.64 (2.5%) -0.7% ( -5% - 4%) 0.377
HighTermTitleBDVSort 5.69 (2.4%) 5.65 (2.5%) -0.7% ( -5% - 4%) 0.387
Fuzzy2 165.14 (2.4%) 164.10 (1.8%) -0.6% ( -4% - 3%) 0.343
MedSpanNear 12.61 (1.9%) 12.53 (2.2%) -0.6% ( -4% - 3%) 0.327
HighTerm 578.67 (6.5%) 575.09 (5.1%) -0.6% ( -11% - 11%) 0.739
MedTermDayTaxoFacets 19.91 (5.1%) 19.79 (4.3%) -0.6% ( -9% - 9%) 0.683
OrHighMedDayTaxoFacets 3.43 (12.0%) 3.41 (11.4%) -0.5% ( -21% - 25%) 0.885
MedTerm 748.03 (5.6%) 744.26 (4.5%) -0.5% ( -10% - 10%) 0.755
AndHighMedDayTaxoFacets 32.72 (1.5%) 32.57 (1.6%) -0.5% ( -3% - 2%) 0.335
LowTerm 526.39 (2.7%) 524.40 (2.7%) -0.4% ( -5% - 5%) 0.653
HighSloppyPhrase 9.56 (2.3%) 9.53 (2.9%) -0.3% ( -5% - 5%) 0.689
HighPhrase 33.59 (3.8%) 33.49 (3.8%) -0.3% ( -7% - 7%) 0.803
MedIntervalsOrdered 14.95 (4.2%) 14.92 (4.4%) -0.2% ( -8% - 8%) 0.876
BrowseDayOfYearSSDVFacets 6.37 (1.4%) 6.36 (1.3%) -0.2% ( -2% - 2%) 0.665
MedSloppyPhrase 15.68 (1.6%) 15.65 (2.8%) -0.2% ( -4% - 4%) 0.799
AndHighMed 109.07 (3.9%) 108.94 (3.2%) -0.1% ( -6% - 7%) 0.915
LowSloppyPhrase 12.64 (1.5%) 12.63 (2.2%) -0.1% ( -3% - 3%) 0.855
OrNotHighMed 377.59 (3.0%) 377.64 (3.3%) 0.0% ( -6% - 6%) 0.989
BrowseMonthSSDVFacets 6.65 (0.4%) 6.65 (0.4%) 0.0% ( 0% - 0%) 0.829
OrHighLow 541.18 (3.3%) 541.47 (3.3%) 0.1% ( -6% - 6%) 0.960
LowIntervalsOrdered 12.42 (4.0%) 12.43 (3.9%) 0.1% ( -7% - 8%) 0.948
BrowseRandomLabelSSDVFacets 5.35 (1.3%) 5.36 (1.1%) 0.1% ( -2% - 2%) 0.809
MedPhrase 47.26 (2.3%) 47.31 (2.4%) 0.1% ( -4% - 4%) 0.889
Prefix3 246.21 (4.9%) 246.48 (7.3%) 0.1% ( -11% - 12%) 0.956
LowPhrase 34.24 (2.5%) 34.30 (2.4%) 0.2% ( -4% - 5%) 0.824
HighTermTitleSort 98.67 (2.9%) 98.84 (4.6%) 0.2% ( -7% - 7%) 0.884
OrHighHigh 42.99 (8.8%) 43.10 (8.6%) 0.3% ( -15% - 19%) 0.924
BrowseDateSSDVFacets 2.24 (13.7%) 2.25 (13.2%) 0.6% ( -23% - 31%) 0.895
Respell 255.75 (1.7%) 257.22 (1.7%) 0.6% ( -2% - 4%) 0.281
BrowseDayOfYearTaxoFacets 7.39 (1.9%) 7.44 (4.1%) 0.7% ( -5% - 6%) 0.489
AndHighLow 961.38 (3.8%) 970.95 (4.1%) 1.0% ( -6% - 9%) 0.427
BrowseDateTaxoFacets 7.08 (1.3%) 7.16 (4.6%) 1.1% ( -4% - 7%) 0.291
TermDTSort 77.89 (7.4%) 78.84 (4.8%) 1.2% ( -10% - 14%) 0.535
PKLookup 250.25 (3.3%) 253.54 (2.7%) 1.3% ( -4% - 7%) 0.169
HighTermDayOfYearSort 107.64 (2.6%) 109.12 (2.1%) 1.4% ( -3% - 6%) 0.064
OrHighMed 109.62 (6.2%) 111.24 (6.1%) 1.5% ( -10% - 14%) 0.447
BrowseRandomLabelTaxoFacets 6.17 (5.0%) 6.27 (3.9%) 1.6% ( -6% - 11%) 0.253
Wildcard 183.93 (3.7%) 187.03 (4.4%) 1.7% ( -6% - 10%) 0.193
LongNRQ 42.83 (6.1%) 83.47 (34.5%) 94.9% ( 51% - 144%) 0.000
IntNRQ 20.29 (3.4%) 53.24 (34.2%) 162.4% ( 120% - 207%) 0.000
The interesting parts here is in the bottom two lines - IntNRQ and LongNRQ becomes much faster. it might be that I messed up the "minimal" part of reproduction, maybe all that was needed is the single-segment and taskCountPerCat
increase.
Regardless, it looks promising - a 94% to 162% increase in QPS for range queries with this PR in the slightly modified benchmark.
Would we be subject to the same issue if/when 3+ different implementations of DocIdSetIterator
get used in IntersectVisitor#visit
?
Would we be subject to the same issue if/when 3+ different implementations of
DocIdSetIterator
get used inIntersectVisitor#visit
?
Yes.
Your question makes me think that maybe we should not add a new DocIdSetIterator
for this PR - maybe that was your thought as well?
Relatedly indeed, I was wondering if the API should expose an IntsRef or something like that, so that there is a single virtual call per block of doc IDs anyway (IntsRef cannot be extended, it's a single impl).
@jpountz I had a quick look at the code, and it seems to me like there are, with this PR, only two implementations used for the DISI used for the IntersectVisitor#visit
method - which is good in the sense that it would probably be fine to merge it now, but bad in the sense that it would make adding a third hurt performance.
Do we believe that we can merge this PR and then continue with changing the BKD visit API in a later change, or should we try to change the abstraction in this PR?
++ on progress over perfection
That said, I wonder if this change is legal: DocIdSetIterator
must return doc IDs in order, but it looks like it wouldn't always be the case with your change?
That said, I wonder if this change is legal:
DocIdSetIterator
must return doc IDs in order, but it looks like it wouldn't always be the case with your change?
You are correct - I had (falsely) assumed that the document ids in the DocIdsWriter were written in order - thus the PR as of now (dacb2a5d3ad707a555ed1fa2925a082d81aa4443
) is not a legal change.
The good news is that this should not have affected the benchmarks - neither the PointRangeQuery
nor the NumericLeafComparator
seems to rely on the DISI being ordered. They just iterate until DocIdSetIterator.NO_MORE_DOCS
.
I will try adding a visit()
method taking an IntsRef (I believe that is what you meant @jpountz?).
I've changed the readInts16/34 methods to now use the IntsRef
instead - I kind of like how it turned out, but the downside is that we will have to add many implementations to see the benefits. I've added the 2 in the PointRangeQuery
- if we want to continue down this path, I can add more.
One thing that I would love input on is whether or not to manually inline the visit(docid)
version or not. E.g. I now do:
for (int i = ref.offset; i < ref.offset + ref.length; i++) {
result.clear(ref.ints[i]);
cost[0]--;
}
While we could do
for (int i = ref.offset; i < ref.offset + ref.length; i++) {
visit(ref.ints[i]);
}
The latter would, most likely, be inlined by the JVM. The first is manually inlined, so we don't need to trust the JVM on it. Thoughts?
It is important to note that just having the default method in IntersectVisitor
would not be good, since that would do virtual calls to visit(int)
. Copy-pasting that method body into each implementation would however, most likely, inline better.
I benchmarked the current commit (e826696fd235cb1ef215745944cac6e34faa0d20
) using the same modified lucenebench as above, it seems to like this change even better:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
LongNRQ 41.11 (6.6%) 107.13 (13.4%) 160.6% ( 131% - 193%) 0.000
IntNRQ 23.23 (7.0%) 85.58 (19.0%) 268.4% ( 226% - 316%) 0.000
(the baseline being master)
I will try adding a visit() method taking an IntsRef (I believe that is what you meant @jpountz?).
This is what I meant indeed.
Before merging I'd be curious to better understand why the JVM doesn't optimize this better. Presumably, it should be able to resolve the virtual call once for the entire for loop rather than doing it again on every iteration? I wonder if there is actually a performance bug, or if we are just insufficiently warming up the JVM and this for loop never gets compiled through C2?
Before merging I'd be curious to better understand why the JVM doesn't optimize this better. Presumably, it should be able to resolve the virtual call once for the entire for loop rather than doing it again on every iteration? I wonder if there is actually a performance bug, or if we are just insufficiently warming up the JVM and this for loop never gets compiled through C2?
Valid questions.
I ran the same luceneutils benchmark with -XX:+LogCompilation
and -XX:PrintInlining
and I can see c2/level4 compilations being done, including BKDPointTree::addAll
and DocIdsWriter::readInts32
. The Elasticsearch benchmarks mentioned in my post (where we saw similar results) were also run long enough that I would be very surprised if c2 had not kicked in.
I also tried reproducing this behavior in a shorter java program for demonstration purposes - with virtual calls similar to the IntersectVisitor
. The program has 3 implementations of an interface processing int
values, with both single value and batch implementations:
import java.util.Random;
public class C2Inlining {
static int ITERATIONS = 50_000;
static int NUM_VALUES = 100_000;
public static void main(String[] args) {
//Generate ints
Random r = new Random();
int[] arr = new int[NUM_VALUES];
for (int i = 0; i < NUM_VALUES; i++) {
arr[i] = r.nextInt();
}
IntProcessor[] intProcessors = {
new IntProcessor1(),
new IntProcessor2(),
//Comment this last one out to trigger bimorphic behaviour
new IntProcessor3()
};
processOneByOne(intProcessors, arr);
processBatch(intProcessors, arr);
}
private static void processOneByOne(IntProcessor[] intProcessors, int[] arr) {
long start = System.nanoTime();
for (int i = 0; i < ITERATIONS; i++) {
for (IntProcessor intProcessor : intProcessors) {
for (int value : arr) {
intProcessor.process(value);
}
}
}
long end = System.nanoTime();
long took = end - start;
System.out.printf("One-by-one: Time per iteration: %.3f ms%n", (((double) took) / ITERATIONS) / 1_000_000d);
}
private static void processBatch(IntProcessor[] intProcessors, int[] arr) {
long start = System.nanoTime();
for (int i = 0; i < ITERATIONS; i++) {
for (IntProcessor intProcessor : intProcessors) {
intProcessor.process(arr);
}
}
long end = System.nanoTime();
long took = end - start;
System.out.printf("Batch: Time per iteration: %.3f ms%n", (((double) took) / ITERATIONS) / 1_000_000d);
}
interface IntProcessor {
void process(int i);
void process(int[] arr);
}
static class IntProcessor1 implements IntProcessor {
static int value;
@Override
public void process(int i) {
value = i;
}
@Override
public void process(int[] arr) {
for (int i = 0; i < arr.length; i++) {
value = arr[i];
}
}
}
static class IntProcessor2 implements IntProcessor {
static int value;
@Override
public void process(int i) {
value = i;
}
@Override
public void process(int[] arr) {
for (int i = 0; i < arr.length; i++) {
value = arr[i];
}
}
}
static class IntProcessor3 implements IntProcessor {
static int value;
@Override
public void process(int i) {
value = i;
}
@Override
public void process(int[] arr) {
for (int i = 0; i < arr.length; i++) {
value = arr[i];
}
}
}
}
I ran this in this form, and with one implementation commented out (to trigger the bimorphic inlining behavior). The timing results are quite extreme (running with jdk21 and -server
)
Variant | One-by-One | Batch |
---|---|---|
Three implementations | 1.101 ms/iter | 0.018 ms/iter |
Two implementations | 0.198 ms/iter | 0.012 ms/iter |
I.e. with three implementations, batching is ~60 times faster than single implementations. With two, it is ~16 times faster. Both versions saw c2 compilations.
This is the same pattern that we see with the lucene code - although the difference is much more extreme (I'm guessing due to the implementation). I do think that we can draw the conclusion that helping the JVM with batch versions of virtual calls can help performance significantly.
One might argue that the JVM should be able to figure this out on it's own. But until it does, let's maybe help it a bit?
Ideally we'd add a query that adds another implementation of an
IntersectVisitor
to the nightly benchmarks before merging that PR so that we can see the performance bump?
Yes - this would be ideal imo. Some caveats that I found while testing:
-
PointRangeQuery
s are executed with differentIntersectVisitor
s depending on expected number of matches, which means that several query intervals need to be used in each JVM. In luceneutil, I believe that this corresponds to settingtaskCountPerCat
high. - To see significant change, the number of documents in the segments needs to either fit into the 16bit or 32bit implementation of
bkd.DocIdsWriter
- this was not the case with the simplelocalrun.py
script out of the box. - It is possible that we see a performance decrease when we add more
IntersectVisitor
variants, since the benchmark becomes more realistically slow with the additional implementations.
Do we believe that we can merge this PR and then continue with changing the BKD visit API in a later change, or should we try to change the abstraction in this PR?
+1 -- PNP!
- In luceneutil, I believe that this corresponds to setting
taskCountPerCat
high.
The nightly benchy uses taskCountPerCat=5
-- is that enough to reveal the impact of this optimization?
But, the nightly benchy does not forceMerge
down to one segment, so, we may not have the right number of docs to use 16 or 32 bit int encoding in the BKD file.
Actually, I don't think we should block this change on fixing nightly benchy to reveal its gains? Your above luceneutil
repro is sufficient? I'm wary of allowing benchmarking to unduly hold up otherwise good Lucene changes in general... it's a helpful tool, not a crutch ;)
Thank you all for helping to get this PR into a better state. The only thing that still irked me is that I couldn't get the tests to execute all newly added methods. I made a few changes in the last commit to:
- Add a new test case which calls
doTestRandomLongs
with 20 000 - without this I couldn't get the IntsRef to trigger often enough. This should maybe be a@Nightly
? - Added a higher chance of having 0 missing values in
doTestRandomLongs
- otherwise the inverse intersect visitor would only have a chance of happening every 100 runs, which I thought a bit low.
I would love it if you, @mikemccand or @jpountz, could give a thumbs up to these last changes? After this, what's the protocol? Do I merge the PR myself or should a committer push the button?
- Add a new test case which calls
doTestRandomLongs
with 20 000 - without this I couldn't get the IntsRef to trigger often enough. This should maybe be a@Nightly
?
How long does the test take to run? It'd be nice to exercise the optimized path in "ordinary" (non-nightly) test runs too ...
I would love it if you, @mikemccand or @jpountz, could give a thumbs up to these last changes?
I'll have a look!
Do I merge the PR myself or should a committer push the button?
One of we committers must merge it! It sounds like we are super close ... I'll try to review today and maybe merge.
- Add a new test case which calls
doTestRandomLongs
with 20 000 - without this I couldn't get the IntsRef to trigger often enough. This should maybe be a@Nightly
?How long does the test take to run? It'd be nice to exercise the optimized path in "ordinary" (non-nightly) test runs too ...
I timed this a bit ... looks like it's ~.25 - .5 seconds. I think that's OK to run always.
One of we committers must merge it! It sounds like we are super close ... I'll try to review today and maybe merge.
Sounds great - the javadoc fix is done. Thanks a lot for having a look and timing the test!
Thanks @antonha -- sorry, maybe also add a CHANGES.txt
entry? This is an exciting opto!
Thanks @antonha -- sorry, maybe also add a
CHANGES.txt
entry? This is an exciting opto!
My bad - I should have added one the first time you asked :see_no_evil:. I've Added one now, let me know if you feel like it lacks something!
OK I merged a backported to 9.11.0 -- I think that's safe: we added a new default method to IntersectVisitor
.