Initial rewrite of MMapDirectory for JDK-17 preview (incubating) Panama APIs (>= JDK-17-ea-b25)
INFO: This is a followup of #173: It's the same code base, but with API changes from JDK 17 applied
This is just a draft PR for a first insight on memory mapping improvements in JDK 17+.
Some background information: Starting with JDK-14, there is a new incubating module "jdk.incubator.foreign" that has a new, not yet stable API for accessing off-heap memory (and later it will also support calling functions using classical MethodHandles that are located in libraries like .so or .dll files). This incubator module has several versions:
- first version: https://openjdk.java.net/jeps/370 (slow, very buggy and thread confinement, so making it unuseable with Lucene)
- second version: https://openjdk.java.net/jeps/383 (still thread confinement, but now allows transfer of "ownership" to other threads; this is still impossible to use with Lucene.
- third version in JDK 16: https://openjdk.java.net/jeps/393 (this version has included "Support for shared segments"). This now allows us to safely use the same external mmaped memory from different threads and also unmap it! This was implemented in the previous pull request #173
- fourth version in JDK 17, included in build 25: https://openjdk.java.net/jeps/412 (actual version). This mainly changes the API around the scopes. Instead of having segments explicitely made "shared", we can assign them to some resource scope which control their behaviour. The resourceScope is produced one time for each IndexInput instance (not clones) and owns all segments. When the resourceScope is closed, all segments get invalid - and we throw
AlreadyClosedException.
This module more or less overcomes several problems:
- ByteBuffer API is limited to 32bit (in fact MMapDirectory has to chunk in 1 GiB portions)
- There is no official way to unmap ByteBuffers when the file is no longer used. There is a way to use
sun.misc.Unsafeand forcefully unmap segments, but any IndexInput accessing the file from another thread will crush the JVM with SIGSEGV or SIGBUS. We learned to live with that and we happily apply the unsafe unmapping, but that's the main issue.
@uschindler had many discussions with the team at OpenJDK and finally with the third incubator, we have an API that works with Lucene. It was very fruitful discussions (thanks to @mcimadamore !)
With the third incubator we are now finally able to do some tests (especially performance). As this is an incubating module, this PR first changes a bit the build system:
- disable
-Werrorfor:lucene:core - add the incubating module to compiler of
:lucene:coreand enable it for all test builds. This is important, as you have to pass--add-modules jdk.incubator.foreignalso at runtime!
The code basically just modifies MMapDirectory to use LONG instead of INT for the chunk size parameter. In addition it adds MemorySegmentIndexInput that is a copy of our ByteBufferIndexInput (still there, but unused), but using MemorySegment instead of ByteBuffer behind the scenes. It works in exactly the same way, just the try/catch blocks for supporting EOFException or moving to another segment were rewritten.
It passes all tests and it looks like you can use it to read indexes. The default chunk size is now 16 GiB (but you can raise or lower it as you like; tests are doing this). Of course you can set it to Long.MAX_VALUE, in that case every index file is always mapped to one big memory mapping. My testing with Windows 10 have shown, that this is not a good idea!!!. Huge mappings fragment address space over time and as we can only use like 43 or 46 bits (depending on OS), the fragmentation will at some point kill you. So 16 GiB looks like a good compromise: Most files will be smaller than 6 GiB anyways (unless you optimize your index to one huge segment). So for most Lucene installations, the number of segments will equal the number of open files, so Elasticsearch huge user consumers will be very happy. The sysctl max_map_count may not need to be touched anymore.
In addition, this implements readLongs in a better way than @jpountz did (no caching or arbitrary objects). Nevertheless, as the new MemorySegment API relies on final, unmodifiable classes and coping memory from a MemorySegment to a on-heap Java array, it requires us to wrap all those arrays using a MemorySegment each time (e.g. in readBytes() or readLELongs), there may be some overhead du to short living object allocations (those are NOT reuseable!!!). In short: In future we should throw away on coping/loading our stuff to heap and maybe throw away IndexInput completely and base our code fully on random access. The new foreign-vector APIs will in future also be written with MemorySegment in its focus. So you can allocate a vector view on a MemorySegment and let the vectorizer fully work outside java heap inside our mmapped files! :-)
It would be good if you could checkout this branch and try it in production.
But be aware:
- You need JDK 17 to compile and run with Gradle (set
JAVA_HOMEto it) - The lucene-core.jar will be JDK17 class files and requires JDK-17 to execute.
- Also you need to add
--add-modules jdk.incubator.foreignto the command line of your Java program/Solr server/Elasticsearch server
It would be good to get some benchmarks, especially by @rmuir or @mikemccand.
My plan is the following:
- report any bugs or slowness, especially with Hotspot optimizations. The last time I talked to Maurizio, he taked about Hotspot not being able to fully optimize for-loops with long instead of int, so it may take some time until the full performance is there.
- wait until the final version of project PANAMA-foreign goes into Java's Core Library (no module needed anymore)
- add a MR-JAR for lucene-core.jar and compile the MemorySegmentIndexInput and maybe some helper classes with JDK 18/19 (hopefully?).
Linux Jenkins tests: https://jenkins.thetaphi.de/job/Lucene-jdk17panama-Linux/ Windows Jenkins tests: https://jenkins.thetaphi.de/job/Lucene-jdk17panama-Windows/
Hi, I executed luceneutil bench using jdk-17-ea+b25:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
BrowseMonthTaxoFacets 1.00 (4.8%) 0.71 (6.7%) -29.1% ( -38% - -18%) 0.000
BrowseDayOfYearTaxoFacets 0.98 (5.9%) 0.70 (7.3%) -28.1% ( -38% - -15%) 0.000
BrowseDateTaxoFacets 0.98 (5.9%) 0.70 (7.2%) -28.1% ( -38% - -15%) 0.000
AndHighLow 326.58 (6.1%) 288.18 (3.6%) -11.8% ( -20% - -2%) 0.000
OrHighLow 203.59 (6.2%) 182.35 (3.3%) -10.4% ( -18% - 0%) 0.000
AndHighMed 51.56 (6.6%) 46.25 (4.8%) -10.3% ( -20% - 1%) 0.000
Respell 35.89 (1.5%) 32.70 (1.8%) -8.9% ( -12% - -5%) 0.000
LowSloppyPhrase 13.87 (3.4%) 12.77 (3.1%) -7.9% ( -13% - -1%) 0.000
PKLookup 189.60 (2.6%) 174.64 (2.7%) -7.9% ( -12% - -2%) 0.000
LowSpanNear 20.18 (3.1%) 18.76 (2.2%) -7.0% ( -11% - -1%) 0.000
Fuzzy1 59.60 (6.5%) 55.88 (6.2%) -6.2% ( -17% - 6%) 0.002
Fuzzy2 49.20 (9.0%) 46.35 (7.5%) -5.8% ( -20% - 11%) 0.026
AndHighHigh 28.28 (4.6%) 26.69 (3.6%) -5.6% ( -13% - 2%) 0.000
MedSpanNear 14.89 (3.4%) 14.16 (3.4%) -4.9% ( -11% - 1%) 0.000
MedSloppyPhrase 52.15 (4.9%) 49.89 (4.2%) -4.3% ( -12% - 5%) 0.003
OrHighMed 40.28 (2.9%) 38.64 (2.7%) -4.1% ( -9% - 1%) 0.000
MedPhrase 247.62 (3.3%) 238.13 (3.0%) -3.8% ( -9% - 2%) 0.000
HighTermDayOfYearSort 21.04 (16.5%) 20.33 (13.5%) -3.4% ( -28% - 31%) 0.475
OrNotHighLow 457.51 (4.1%) 442.74 (2.9%) -3.2% ( -9% - 3%) 0.004
OrHighHigh 7.00 (3.0%) 6.78 (3.0%) -3.0% ( -8% - 2%) 0.001
HighSpanNear 1.77 (3.4%) 1.72 (3.2%) -2.8% ( -9% - 3%) 0.006
Wildcard 55.88 (3.4%) 54.34 (3.2%) -2.8% ( -9% - 4%) 0.009
BrowseMonthSSDVFacets 4.18 (5.4%) 4.08 (5.2%) -2.5% ( -12% - 8%) 0.140
HighSloppyPhrase 6.16 (7.9%) 6.01 (6.0%) -2.4% ( -15% - 12%) 0.278
TermDTSort 56.84 (19.7%) 55.64 (15.8%) -2.1% ( -31% - 41%) 0.707
BrowseDayOfYearSSDVFacets 4.06 (3.8%) 3.97 (4.0%) -2.1% ( -9% - 5%) 0.096
LowPhrase 38.74 (3.7%) 37.95 (2.7%) -2.1% ( -8% - 4%) 0.047
Prefix3 59.38 (5.9%) 58.16 (6.0%) -2.0% ( -13% - 10%) 0.273
HighIntervalsOrdered 5.89 (6.2%) 5.78 (5.1%) -2.0% ( -12% - 9%) 0.274
HighTermTitleBDVSort 34.99 (13.7%) 34.60 (13.1%) -1.1% ( -24% - 29%) 0.795
HighTermMonthSort 56.96 (16.7%) 57.28 (14.7%) 0.6% ( -26% - 38%) 0.910
OrHighNotLow 550.54 (5.1%) 558.86 (5.0%) 1.5% ( -8% - 12%) 0.347
IntNRQ 33.75 (30.5%) 34.28 (29.5%) 1.6% ( -44% - 88%) 0.870
HighTerm 1141.71 (6.2%) 1163.18 (4.6%) 1.9% ( -8% - 13%) 0.273
HighPhrase 12.90 (3.9%) 13.15 (4.3%) 1.9% ( -6% - 10%) 0.136
OrHighNotHigh 640.96 (6.4%) 665.45 (3.8%) 3.8% ( -5% - 14%) 0.022
OrNotHighMed 692.81 (4.4%) 726.70 (4.6%) 4.9% ( -3% - 14%) 0.001
OrNotHighHigh 698.56 (4.5%) 739.13 (3.7%) 5.8% ( -2% - 14%) 0.000
MedTerm 1594.10 (5.5%) 1700.62 (4.6%) 6.7% ( -3% - 17%) 0.000
LowTerm 1531.60 (5.2%) 1647.26 (4.6%) 7.6% ( -2% - 18%) 0.000
OrHighNotMed 911.74 (5.3%) 990.26 (4.4%) 8.6% ( -1% - 19%) 0.000
For those interested:
JAVA:
WARNING: Using incubator modules: jdk.incubator.foreign
openjdk version "17-ea" 2021-09-14
OpenJDK Runtime Environment (build 17-ea+25-2252)
OpenJDK 64-Bit Server VM (build 17-ea+25-2252, mixed mode, sharing)
OS:
Linux serv1.sd-datasolutions.de 5.8.0-50-generic #56~20.04.1-Ubuntu SMP Mon Apr 12 21:46:35 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
LOGS:
/home/thetaphi/benchmark/logs
My localconstants.py, so that's how I added the module command line params. @mikemccand It would be better to have the tool respect RUNTIME_JAVA_HOME like gradlew does.
BASE_DIR = '/home/thetaphi/benchmark'
BENCH_BASE_DIR = '/home/thetaphi/benchmark/util'
JAVA_HOME = '/home/jenkins/tools/java/64bit/latest-jdk17'
JAVA_EXE = '%s/bin/java' %JAVA_HOME
JAVAC_EXE = '%s/bin/javac' %JAVA_HOME
JAVA_COMMAND = '%s --add-modules jdk.incubator.foreign -server -Xms2g -Xmx2g -XX:-TieredCompilation -XX:+HeapDumpOnOutOfMemoryError
Executed this with all defaults in localrun.py:
$ python3 src/python/localrun.py -source wikimediumall
I have no idea why the facetting stuff at the beginning of the bench output is so badly behaving with MMapDirectory#v2 on top of project panama, I'll ignore this for now. Maybe @mikemccand has an idea! The rest looks perfectly fine to me from the performance (19 runs).
After analyzing the heap dumps provided by JFR, I was able to figure out what the problem is.
Basically, all native, VarHandle backed methods are fast and optimize nice. But all "bulk" read methods are slow and produce a lot of garbage on heap:
The problem is that we can't copy from a memory segment to a heap byte[] or heap float[] natively! So the whole wrapping with MemorySegment.ofArray(), applying offsets and length, copy memory makes the whole thing produce too much garbage, because it looks like Hotspot isn't able to remove the object allocation.
I did a quick test:
- I removed bulk
readLongs()and bulkreadFloats()from source code and let it fall through to the simple readLong/readFloat loop of DataInput and the slowdown suddely was going to zero! - I replaced the readBytes() code also by a simple loop in the case the byte[] length to read is < 4096 bytes
See this hack commit (not in this branch): https://github.com/uschindler/lucene/commit/9b328a74746a04351a99f33f82515122b06d5baa
The speed is much better, but some tasks are slower now, but this is related because now stuff like copying or uncompressing stored fields may be slower partially.
With my patch to bulk methods results look like this:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
Respell 55.14 (0.9%) 40.97 (0.9%) -25.7% ( -27% - -24%) 0.000
Fuzzy1 66.39 (6.3%) 51.40 (2.7%) -22.6% ( -29% - -14%) 0.000
Fuzzy2 63.82 (6.7%) 50.99 (2.3%) -20.1% ( -27% - -11%) 0.000
PKLookup 190.65 (1.6%) 172.48 (2.1%) -9.5% ( -13% - -5%) 0.000
Wildcard 12.84 (6.3%) 11.62 (4.0%) -9.5% ( -18% - 0%) 0.004
HighTermMonthSort 61.20 (19.6%) 56.61 (15.7%) -7.5% ( -35% - 34%) 0.505
HighSloppyPhrase 1.93 (5.5%) 1.80 (10.8%) -6.8% ( -21% - 10%) 0.213
MedSloppyPhrase 8.01 (3.2%) 7.47 (8.4%) -6.6% ( -17% - 5%) 0.099
LowSloppyPhrase 11.52 (2.7%) 10.79 (5.4%) -6.3% ( -14% - 1%) 0.020
HighTermTitleBDVSort 76.19 (8.5%) 72.67 (9.6%) -4.6% ( -20% - 14%) 0.421
HighTermDayOfYearSort 59.52 (7.0%) 57.12 (7.7%) -4.0% ( -17% - 11%) 0.386
Prefix3 67.21 (13.5%) 64.93 (14.0%) -3.4% ( -27% - 27%) 0.696
TermDTSort 58.93 (12.0%) 56.98 (15.9%) -3.3% ( -27% - 27%) 0.711
MedSpanNear 5.41 (1.5%) 5.24 (1.3%) -3.1% ( -5% - 0%) 0.001
LowSpanNear 21.34 (0.6%) 20.72 (1.2%) -2.9% ( -4% - -1%) 0.000
OrHighHigh 15.81 (1.1%) 15.35 (2.5%) -2.9% ( -6% - 0%) 0.019
BrowseMonthSSDVFacets 4.29 (2.8%) 4.16 (4.1%) -2.8% ( -9% - 4%) 0.201
OrHighMed 18.71 (1.6%) 18.19 (2.7%) -2.8% ( -6% - 1%) 0.046
LowPhrase 139.63 (1.9%) 136.16 (1.3%) -2.5% ( -5% - 0%) 0.018
MedPhrase 114.24 (2.2%) 111.48 (1.0%) -2.4% ( -5% - 0%) 0.025
HighPhrase 180.83 (2.5%) 176.61 (1.4%) -2.3% ( -6% - 1%) 0.065
AndHighLow 496.29 (4.3%) 484.90 (2.4%) -2.3% ( -8% - 4%) 0.300
HighSpanNear 9.11 (2.4%) 8.94 (2.5%) -1.9% ( -6% - 3%) 0.233
BrowseDayOfYearSSDVFacets 4.14 (2.5%) 4.08 (0.4%) -1.5% ( -4% - 1%) 0.172
IntNRQ 38.03 (0.6%) 37.51 (1.4%) -1.4% ( -3% - 0%) 0.047
OrNotHighLow 626.89 (4.7%) 618.44 (5.3%) -1.3% ( -10% - 9%) 0.669
OrHighLow 280.08 (2.5%) 276.64 (3.2%) -1.2% ( -6% - 4%) 0.494
AndHighHigh 30.63 (2.5%) 30.30 (2.3%) -1.1% ( -5% - 3%) 0.468
AndHighMed 35.60 (2.1%) 35.23 (1.7%) -1.0% ( -4% - 2%) 0.385
MedTerm 1212.11 (3.4%) 1204.02 (6.3%) -0.7% ( -10% - 9%) 0.836
HighIntervalsOrdered 10.38 (3.7%) 10.42 (5.5%) 0.4% ( -8% - 9%) 0.880
OrHighNotLow 738.69 (3.0%) 744.71 (7.0%) 0.8% ( -8% - 11%) 0.811
LowTerm 1563.73 (2.3%) 1577.33 (4.4%) 0.9% ( -5% - 7%) 0.696
OrHighNotHigh 725.65 (3.8%) 732.67 (6.1%) 1.0% ( -8% - 11%) 0.763
OrNotHighMed 681.22 (1.2%) 688.45 (4.5%) 1.1% ( -4% - 6%) 0.609
OrHighNotMed 672.06 (3.8%) 679.93 (6.9%) 1.2% ( -9% - 12%) 0.741
HighTerm 1615.37 (4.1%) 1636.17 (4.2%) 1.3% ( -6% - 9%) 0.623
OrNotHighHigh 634.61 (2.5%) 645.91 (7.2%) 1.8% ( -7% - 11%) 0.600
BrowseDayOfYearTaxoFacets 0.99 (5.4%) 1.21 (7.8%) 21.8% ( 8% - 37%) 0.000
BrowseDateTaxoFacets 0.99 (5.2%) 1.21 (7.9%) 22.0% ( 8% - 37%) 0.000
BrowseMonthTaxoFacets 1.02 (4.1%) 1.25 (6.3%) 22.8% ( 11% - 34%) 0.000
No idea why this happens.
I did more testig and was able to make MMapDirectory work the same speed. This patch fixing the perf issue shows the problems: https://github.com/uschindler/lucene/commit/b057213cc6548ffa29b8e2810f39eb84c50a3bc0
The main issue can be seen in the copying, still used by Lucene between mmaped segment and heap. Here is the modified code by the patch using sun.misc.Unsafe to copy memory instead of the code commented out:
private static void copySegmentToHeap(MemorySegment src, long srcOffset, byte[] target, int targetOffset, int len) {
Objects.checkFromIndexSize(srcOffset, len, src.byteSize());
theUnsafe.copyMemory(null, src.address().toRawLongValue() + srcOffset,
target, Unsafe.ARRAY_BYTE_BASE_OFFSET + targetOffset, len);
//MemorySegment.ofArray(target).asSlice(targetOffset, len).copyFrom(src.asSlice(srcOffset, len));
}
If you do the memory copy using Unsafe as this static method shows, the performance is identical:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
PKLookup 186.78 (2.2%) 180.11 (1.4%) -3.6% ( -7% - 0%) 0.000
BrowseMonthSSDVFacets 4.14 (6.0%) 4.05 (6.3%) -2.3% ( -13% - 10%) 0.238
BrowseDayOfYearSSDVFacets 4.03 (4.4%) 3.94 (4.7%) -2.2% ( -10% - 7%) 0.123
Prefix3 5.43 (10.4%) 5.36 (11.7%) -1.3% ( -21% - 23%) 0.708
IntNRQ 41.13 (2.0%) 40.61 (1.8%) -1.2% ( -4% - 2%) 0.040
LowSloppyPhrase 5.85 (4.2%) 5.79 (3.8%) -1.0% ( -8% - 7%) 0.411
HighSloppyPhrase 9.74 (4.1%) 9.64 (3.9%) -1.0% ( -8% - 7%) 0.416
HighPhrase 226.76 (2.7%) 225.13 (2.9%) -0.7% ( -6% - 5%) 0.417
MedPhrase 204.52 (2.7%) 203.77 (2.5%) -0.4% ( -5% - 4%) 0.656
Fuzzy1 84.24 (4.5%) 84.20 (5.9%) -0.0% ( -9% - 10%) 0.979
LowPhrase 23.33 (2.6%) 23.33 (2.7%) -0.0% ( -5% - 5%) 0.986
Respell 34.95 (1.5%) 34.97 (1.6%) 0.1% ( -3% - 3%) 0.898
HighTerm 1128.45 (4.8%) 1129.85 (4.7%) 0.1% ( -8% - 10%) 0.934
HighTermMonthSort 86.97 (17.0%) 87.24 (14.5%) 0.3% ( -26% - 38%) 0.951
OrHighNotLow 913.63 (4.0%) 919.28 (3.5%) 0.6% ( -6% - 8%) 0.603
OrNotHighLow 630.14 (2.8%) 635.01 (3.0%) 0.8% ( -4% - 6%) 0.401
AndHighHigh 42.87 (4.2%) 43.22 (4.4%) 0.8% ( -7% - 9%) 0.550
MedTerm 1504.76 (4.3%) 1519.72 (3.8%) 1.0% ( -6% - 9%) 0.440
MedSloppyPhrase 38.70 (3.2%) 39.11 (2.6%) 1.1% ( -4% - 7%) 0.251
AndHighLow 464.18 (4.4%) 469.92 (4.6%) 1.2% ( -7% - 10%) 0.388
OrNotHighMed 762.97 (2.8%) 773.13 (3.2%) 1.3% ( -4% - 7%) 0.162
OrHighLow 290.51 (4.7%) 294.49 (4.3%) 1.4% ( -7% - 10%) 0.339
Fuzzy2 51.19 (8.4%) 51.90 (7.2%) 1.4% ( -13% - 18%) 0.576
LowTerm 1892.07 (2.9%) 1919.02 (4.0%) 1.4% ( -5% - 8%) 0.199
AndHighMed 43.09 (5.0%) 43.76 (4.7%) 1.6% ( -7% - 11%) 0.311
OrHighNotMed 708.72 (3.0%) 719.93 (2.7%) 1.6% ( -4% - 7%) 0.084
OrHighNotHigh 791.29 (4.2%) 803.91 (3.7%) 1.6% ( -6% - 9%) 0.204
OrHighMed 37.87 (4.8%) 38.52 (3.2%) 1.7% ( -5% - 10%) 0.179
OrNotHighHigh 755.35 (3.4%) 769.72 (3.6%) 1.9% ( -4% - 9%) 0.084
HighTermDayOfYearSort 66.49 (17.2%) 67.81 (15.3%) 2.0% ( -25% - 41%) 0.700
HighIntervalsOrdered 5.10 (4.0%) 5.20 (3.9%) 2.0% ( -5% - 10%) 0.108
HighTermTitleBDVSort 71.78 (17.6%) 73.23 (16.8%) 2.0% ( -27% - 44%) 0.712
OrHighHigh 9.62 (6.7%) 9.83 (3.2%) 2.2% ( -7% - 13%) 0.181
Wildcard 13.65 (9.0%) 14.11 (10.0%) 3.4% ( -14% - 24%) 0.264
TermDTSort 61.68 (15.7%) 64.16 (9.1%) 4.0% ( -18% - 34%) 0.324
HighSpanNear 8.32 (2.6%) 8.72 (3.1%) 4.7% ( 0% - 10%) 0.000
LowSpanNear 10.98 (2.1%) 11.54 (2.2%) 5.1% ( 0% - 9%) 0.000
MedSpanNear 8.34 (2.0%) 8.77 (2.2%) 5.2% ( 0% - 9%) 0.000
BrowseMonthTaxoFacets 1.02 (3.5%) 1.13 (13.6%) 10.1% ( -6% - 28%) 0.001
BrowseDayOfYearTaxoFacets 1.00 (4.6%) 1.11 (14.4%) 10.5% ( -8% - 30%) 0.002
BrowseDateTaxoFacets 1.00 (4.6%) 1.11 (14.4%) 10.7% ( -8% - 31%) 0.002
CPU merged search profile for my_modified_version:
JFR aggregation command: /home/jenkins/tools/java/64bit/latest-jdk17/bin/java --add-modules jdk.incubator.foreign -server -Xms2g -Xmx2g -XX:-TieredCompilation -XX:+HeapDumpOnOutOfMemoryError -Xbatch -cp /home/thetaphi/benchmark/lucene_candidate/buildSrc/build/classes/java/main -Dtests.profile.mode=cpu -Dtests.profile.stacksize=1 -Dtests.profile.count=30 org.apache.lucene.gradle.ProfileResults /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-12.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-8.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-11.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-18.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-17.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-0.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-14.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-10.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-4.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-5.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-16.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-2.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-6.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-15.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-1.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-19.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-13.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-7.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-9.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-3.jfr
Took 4.04 seconds
WARNING: Using incubator modules: jdk.incubator.foreign
PROFILE SUMMARY from 1544656 events (total: 1M)
tests.profile.mode=cpu
tests.profile.count=30
tests.profile.stacksize=1
tests.profile.linenumbers=false
PERCENT CPU SAMPLES STACK
15.33% 236729 org.apache.lucene.util.packed.DirectMonotonicReader#get()
9.95% 153617 org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts#countAll()
7.58% 117056 org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$15#binaryValue()
5.42% 83667 org.apache.lucene.util.packed.DirectReader$DirectPackedReader12#get()
4.47% 68989 jdk.internal.misc.ScopedMemoryAccess#getShortUnalignedInternal()
4.43% 68353 jdk.internal.foreign.Utils#filterSegment()
2.49% 38393 org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegment()
1.98% 30594 jdk.internal.foreign.AbstractMemorySegmentImpl#checkBoundsSmall()
1.91% 29460 org.apache.lucene.index.SingletonSortedSetDocValues#nextDoc()
1.82% 28104 org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#increment()
1.73% 26738 jdk.internal.foreign.AbstractMemorySegmentImpl#isSet()
1.72% 26516 sun.misc.Unsafe#copyMemory()
1.63% 25197 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum#nextDoc()
1.55% 23880 jdk.internal.misc.ScopedMemoryAccess#getByteInternal()
1.42% 21908 jdk.internal.util.Preconditions#checkFromIndexSize()
1.31% 20268 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$EverythingEnum#advance()
1.10% 16954 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsPostingsEnum#advance()
1.00% 15398 org.apache.lucene.search.ConjunctionDISI#doNext()
0.94% 14473 java.lang.invoke.VarHandleGuards#guard_LJ_I()
0.90% 13916 jdk.internal.foreign.SharedScope#checkValidState()
0.83% 12820 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#findFirstGreater()
0.78% 12064 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$EverythingEnum#nextPosition()
0.73% 11224 org.apache.lucene.index.SingletonSortedSetDocValues#nextOrd()
0.72% 11055 org.apache.lucene.util.packed.DirectReader$DirectPackedReader4#get()
0.71% 10950 org.apache.lucene.store.DataInput#readVInt()
0.69% 10643 org.apache.lucene.store.MemorySegmentIndexInput$SingleSegmentImpl#seek()
0.69% 10612 org.apache.lucene.queries.intervals.OrderedIntervalsSource$OrderedIntervalIterator#nextInterval()
0.61% 9380 jdk.internal.foreign.AbstractMemorySegmentImpl#scope()
0.57% 8825 org.apache.lucene.util.BitSet#or()
0.57% 8747 java.util.Objects#checkIndex()
CPU merged search profile for baseline:
JFR aggregation command: /home/jenkins/tools/java/64bit/latest-jdk17/bin/java --add-modules jdk.incubator.foreign -server -Xms2g -Xmx2g -XX:-TieredCompilation -XX:+HeapDumpOnOutOfMemoryError -Xbatch -cp /home/thetaphi/benchmark/lucene_baseline/buildSrc/build/classes/java/main -Dtests.profile.mode=cpu -Dtests.profile.stacksize=1 -Dtests.profile.count=30 org.apache.lucene.gradle.ProfileResults /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-3.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-17.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-16.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-13.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-12.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-5.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-18.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-9.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-1.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-19.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-4.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-2.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-8.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-11.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-10.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-14.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-15.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-0.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-7.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-6.jfr
Took 4.24 seconds
WARNING: Using incubator modules: jdk.incubator.foreign
PROFILE SUMMARY from 1712367 events (total: 1M)
tests.profile.mode=cpu
tests.profile.count=30
tests.profile.stacksize=1
tests.profile.linenumbers=false
PERCENT CPU SAMPLES STACK
12.99% 222468 org.apache.lucene.util.packed.DirectMonotonicReader#get()
9.56% 163732 org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts#countAll()
5.78% 99057 java.nio.ByteBuffer#getArray()
5.05% 86545 org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$15#binaryValue()
4.12% 70481 org.apache.lucene.util.packed.DirectReader$DirectPackedReader12#get()
3.96% 67813 java.nio.Buffer#scope()
3.87% 66312 jdk.internal.misc.Unsafe#convEndian()
3.81% 65189 org.apache.lucene.store.ByteBufferGuard#ensureValid()
3.48% 59539 org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegment()
2.30% 39343 org.apache.lucene.store.ByteBufferGuard#getBytes()
2.14% 36573 java.nio.ByteBuffer#get()
1.88% 32278 org.apache.lucene.index.SingletonSortedSetDocValues#nextDoc()
1.67% 28624 java.nio.Buffer#checkIndex()
1.53% 26222 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum#nextDoc()
1.22% 20948 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$EverythingEnum#advance()
1.19% 20403 org.apache.lucene.store.ByteBufferGuard#getShort()
1.03% 17553 org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#increment()
1.01% 17218 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsPostingsEnum#advance()
0.94% 16167 java.nio.Buffer#position()
0.89% 15218 org.apache.lucene.store.ByteBufferIndexInput#readBytes()
0.88% 15123 org.apache.lucene.search.ConjunctionDISI#doNext()
0.76% 12938 org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl#seek()
0.75% 12878 org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$19#ordValue()
0.75% 12813 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#findFirstGreater()
0.73% 12537 org.apache.lucene.util.packed.DirectReader$DirectPackedReader4#get()
0.69% 11839 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$EverythingEnum#nextPosition()
0.69% 11738 org.apache.lucene.store.DataInput#readVInt()
0.65% 11070 org.apache.lucene.queries.intervals.OrderedIntervalsSource$OrderedIntervalIterator#nextInterval()
0.62% 10575 jdk.internal.util.Preconditions#checkFromIndexSize()
0.59% 10022 jdk.internal.misc.ScopedMemoryAccess#getByteInternal()
HEAP merged search profile for my_modified_version:
JFR aggregation command: /home/jenkins/tools/java/64bit/latest-jdk17/bin/java --add-modules jdk.incubator.foreign -server -Xms2g -Xmx2g -XX:-TieredCompilation -XX:+HeapDumpOnOutOfMemoryError -Xbatch -cp /home/thetaphi/benchmark/lucene_candidate/buildSrc/build/classes/java/main -Dtests.profile.mode=heap -Dtests.profile.stacksize=1 -Dtests.profile.count=30 org.apache.lucene.gradle.ProfileResults /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-12.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-8.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-11.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-18.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-17.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-0.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-14.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-10.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-4.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-5.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-16.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-2.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-6.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-15.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-1.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-19.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-13.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-7.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-9.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-3.jfr
Took 2.67 seconds
WARNING: Using incubator modules: jdk.incubator.foreign
PROFILE SUMMARY from 68380 events (total: 25828M)
tests.profile.mode=heap
tests.profile.count=30
tests.profile.stacksize=1
tests.profile.linenumbers=false
PERCENT HEAP SAMPLES STACK
18.44% 4761M org.apache.lucene.util.FixedBitSet#<init>()
7.55% 1948M java.util.AbstractList#iterator()
5.46% 1409M org.apache.lucene.search.ExactPhraseMatcher$1$1#getImpacts()
5.21% 1345M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame#<init>()
4.55% 1174M org.apache.lucene.util.BytesRef#<init>()
4.14% 1068M org.apache.lucene.search.ExactPhraseMatcher$1#getImpacts()
3.95% 1020M org.apache.lucene.util.fst.ByteSequenceOutputs#read()
3.75% 967M org.apache.lucene.util.ArrayUtil#growExact()
3.13% 808M org.apache.lucene.queryparser.charstream.FastCharStream#refill()
2.90% 749M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum#<init>()
2.07% 534M org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame#load()
1.90% 492M java.util.ArrayList#grow()
1.78% 460M jdk.internal.misc.Unsafe#allocateUninitializedArray()
1.43% 369M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#newTermState()
1.32% 340M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#getFrame()
1.31% 337M java.util.AbstractList#listIterator()
1.27% 328M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#<init>()
1.23% 317M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#<init>()
1.13% 290M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsPostingsEnum#<init>()
1.11% 287M java.util.ArrayList#iterator()
1.09% 282M org.apache.lucene.codecs.lucene90.ForUtil#<init>()
1.07% 276M org.apache.lucene.queryparser.charstream.FastCharStream#GetImage()
1.06% 274M org.apache.lucene.store.MemorySegmentIndexInput#buildSlice()
1.01% 259M java.util.Arrays#asList()
0.97% 250M java.util.Arrays#copyOf()
0.96% 248M org.apache.lucene.util.PriorityQueue#<init>()
0.91% 233M org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#document()
0.87% 223M org.apache.lucene.queryparser.classic.Token#newToken()
0.86% 222M org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame#<init>()
0.84% 216M jdk.internal.foreign.MappedMemorySegmentImpl#dup()
HEAP merged search profile for baseline:
JFR aggregation command: /home/jenkins/tools/java/64bit/latest-jdk17/bin/java --add-modules jdk.incubator.foreign -server -Xms2g -Xmx2g -XX:-TieredCompilation -XX:+HeapDumpOnOutOfMemoryError -Xbatch -cp /home/thetaphi/benchmark/lucene_baseline/buildSrc/build/classes/java/main -Dtests.profile.mode=heap -Dtests.profile.stacksize=1 -Dtests.profile.count=30 org.apache.lucene.gradle.ProfileResults /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-3.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-17.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-16.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-13.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-12.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-5.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-18.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-9.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-1.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-19.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-4.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-2.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-8.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-11.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-10.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-14.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-15.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-0.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-7.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-6.jfr
Took 2.58 seconds
WARNING: Using incubator modules: jdk.incubator.foreign
PROFILE SUMMARY from 69795 events (total: 26355M)
tests.profile.mode=heap
tests.profile.count=30
tests.profile.stacksize=1
tests.profile.linenumbers=false
PERCENT HEAP SAMPLES STACK
17.98% 4739M org.apache.lucene.util.FixedBitSet#<init>()
7.47% 1968M java.util.AbstractList#iterator()
5.16% 1359M org.apache.lucene.search.ExactPhraseMatcher$1$1#getImpacts()
4.97% 1309M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame#<init>()
4.44% 1169M org.apache.lucene.util.BytesRef#<init>()
4.08% 1074M org.apache.lucene.search.ExactPhraseMatcher$1#getImpacts()
3.91% 1030M org.apache.lucene.util.fst.ByteSequenceOutputs#read()
3.38% 892M org.apache.lucene.queryparser.charstream.FastCharStream#refill()
3.36% 885M org.apache.lucene.util.ArrayUtil#growExact()
2.81% 741M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum#<init>()
2.08% 548M org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame#load()
2.00% 528M java.util.ArrayList#grow()
1.82% 479M jdk.internal.misc.Unsafe#allocateUninitializedArray()
1.58% 415M java.nio.DirectByteBufferR#duplicate()
1.43% 376M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#getFrame()
1.41% 370M java.util.AbstractList#listIterator()
1.33% 350M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#newTermState()
1.26% 330M org.apache.lucene.queryparser.charstream.FastCharStream#GetImage()
1.17% 307M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#<init>()
1.16% 306M org.apache.lucene.codecs.lucene90.ForUtil#<init>()
1.14% 299M java.util.ArrayList#iterator()
1.12% 295M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#<init>()
1.00% 263M java.nio.DirectByteBufferR#slice()
0.98% 257M java.util.Arrays#asList()
0.97% 254M org.apache.lucene.util.PriorityQueue#<init>()
0.91% 239M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsPostingsEnum#<init>()
0.88% 231M org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#document()
0.73% 192M java.nio.DirectByteBufferR#asLongBuffer()
0.70% 184M org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame#<init>()
0.68% 179M org.apache.lucene.store.ByteBufferIndexInput#newCloneInstance()
If you comment out the unsafe code and use the "official MemorySegment API", the whole thing gets crazy:
- runtime of one bench run of modified code on my machine goes up from 57 seconds to 75 seconds, while otherwise (Unsafe) staying identical to baseline
- in addition it produces a lot of garbage, heap dump contains many
HeapMemorySegmentImpl$OfByteclasses (that are the wrappers aroundbyte[]when viewed as MemorySegment. Every wrapping produces a new instance which is not catched by escape analysis. This slows down! The heap dump has 50% of heap filled with those objects according to JFR.
I will report this problem to project Panama! @mcimadamore
Here is the results of the original pull request without the unsafe memory copy, using the fromArray() and slicing code. During runtime, the benchmark runs 72 seconds vs 60 seconds. The 12 seconds are mostly caused by garbage collection and allocations.
At the end of the report you see an overview of the heap allocations seen by JFR:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
BrowseMonthTaxoFacets 1.01 (5.3%) 0.71 (5.9%) -29.7% ( -38% - -19%) 0.000
BrowseDateTaxoFacets 0.98 (7.0%) 0.70 (5.9%) -28.7% ( -38% - -17%) 0.000
BrowseDayOfYearTaxoFacets 0.98 (7.0%) 0.70 (5.9%) -28.7% ( -38% - -16%) 0.000
Wildcard 19.36 (9.9%) 16.30 (6.8%) -15.8% ( -29% - 0%) 0.000
AndHighLow 273.88 (3.5%) 240.26 (2.4%) -12.3% ( -17% - -6%) 0.000
LowSloppyPhrase 9.80 (4.1%) 8.85 (4.3%) -9.7% ( -17% - -1%) 0.000
MedSloppyPhrase 6.82 (5.2%) 6.24 (5.3%) -8.4% ( -18% - 2%) 0.000
OrHighLow 220.56 (3.2%) 202.29 (3.1%) -8.3% ( -14% - -2%) 0.000
Respell 54.65 (1.4%) 50.44 (1.5%) -7.7% ( -10% - -4%) 0.000
HighSloppyPhrase 3.57 (4.0%) 3.31 (4.8%) -7.3% ( -15% - 1%) 0.000
AndHighMed 116.21 (4.8%) 107.99 (3.6%) -7.1% ( -14% - 1%) 0.000
PKLookup 188.31 (2.6%) 175.71 (2.3%) -6.7% ( -11% - -1%) 0.000
AndHighHigh 26.44 (4.9%) 24.69 (4.2%) -6.6% ( -15% - 2%) 0.000
LowSpanNear 26.13 (1.9%) 24.55 (2.5%) -6.1% ( -10% - -1%) 0.000
OrHighMed 58.52 (4.3%) 55.04 (4.0%) -5.9% ( -13% - 2%) 0.000
MedSpanNear 16.21 (2.1%) 15.27 (3.6%) -5.8% ( -11% - 0%) 0.000
Fuzzy2 62.90 (10.2%) 59.78 (13.1%) -5.0% ( -25% - 20%) 0.182
Fuzzy1 47.53 (5.7%) 45.81 (4.7%) -3.6% ( -13% - 7%) 0.027
BrowseMonthSSDVFacets 4.13 (5.4%) 3.99 (6.5%) -3.3% ( -14% - 9%) 0.079
TermDTSort 62.99 (15.3%) 60.93 (13.8%) -3.3% ( -28% - 30%) 0.476
BrowseDayOfYearSSDVFacets 4.02 (4.2%) 3.89 (5.5%) -3.2% ( -12% - 6%) 0.036
OrHighHigh 17.99 (5.2%) 17.41 (5.4%) -3.2% ( -13% - 7%) 0.055
HighPhrase 15.57 (3.6%) 15.08 (3.0%) -3.1% ( -9% - 3%) 0.003
HighTermTitleBDVSort 58.97 (19.8%) 57.13 (21.8%) -3.1% ( -37% - 48%) 0.636
HighSpanNear 4.84 (2.1%) 4.69 (1.5%) -3.1% ( -6% - 0%) 0.000
LowPhrase 6.95 (4.7%) 6.76 (3.1%) -2.7% ( -10% - 5%) 0.031
HighTermDayOfYearSort 79.36 (13.6%) 77.50 (16.1%) -2.3% ( -28% - 31%) 0.618
HighIntervalsOrdered 6.10 (3.7%) 5.96 (4.5%) -2.3% ( -10% - 6%) 0.079
MedPhrase 20.71 (3.9%) 20.50 (2.6%) -1.0% ( -7% - 5%) 0.326
HighTermMonthSort 73.31 (12.2%) 72.61 (15.6%) -1.0% ( -25% - 30%) 0.828
OrNotHighLow 659.29 (3.6%) 661.62 (4.6%) 0.4% ( -7% - 8%) 0.785
Prefix3 56.18 (6.0%) 56.64 (7.3%) 0.8% ( -11% - 15%) 0.699
OrNotHighHigh 517.12 (4.0%) 525.07 (3.9%) 1.5% ( -6% - 9%) 0.217
OrHighNotHigh 588.54 (4.1%) 612.06 (4.6%) 4.0% ( -4% - 13%) 0.004
LowTerm 1337.81 (3.5%) 1404.26 (3.4%) 5.0% ( -1% - 12%) 0.000
OrHighNotMed 738.42 (3.2%) 787.11 (4.6%) 6.6% ( -1% - 14%) 0.000
OrNotHighMed 741.56 (5.8%) 790.89 (4.0%) 6.7% ( -2% - 17%) 0.000
MedTerm 1707.00 (3.8%) 1831.49 (4.5%) 7.3% ( 0% - 16%) 0.000
OrHighNotLow 843.88 (4.5%) 906.51 (4.6%) 7.4% ( -1% - 17%) 0.000
HighTerm 1929.94 (6.5%) 2137.55 (5.9%) 10.8% ( -1% - 24%) 0.000
IntNRQ 268.71 (6.0%) 317.78 (6.2%) 18.3% ( 5% - 32%) 0.000
CPU merged search profile for my_modified_version:
JFR aggregation command: /home/jenkins/tools/java/64bit/latest-jdk17/bin/java --add-modules jdk.incubator.foreign -server -Xms2g -Xmx2g -XX:-TieredCompilation -XX:+HeapDumpOnOutOfMemoryError -Xbatch -cp /home/thetaphi/benchmark/lucene_candidate/buildSrc/build/classes/java/main -Dtests.profile.mode=cpu -Dtests.profile.stacksize=1 -Dtests.profile.count=30 org.apache.lucene.gradle.ProfileResults /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-12.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-8.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-11.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-18.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-17.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-0.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-14.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-10.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-4.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-5.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-16.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-2.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-6.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-15.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-1.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-19.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-13.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-7.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-9.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-3.jfr
Took 6.45 seconds
WARNING: Using incubator modules: jdk.incubator.foreign
PROFILE SUMMARY from 2155605 events (total: 2M)
tests.profile.mode=cpu
tests.profile.count=30
tests.profile.stacksize=1
tests.profile.linenumbers=false
PERCENT CPU SAMPLES STACK
10.30% 221932 org.apache.lucene.util.packed.DirectMonotonicReader#get()
9.54% 205542 org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts#countAll()
5.85% 126158 org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$15#binaryValue()
3.90% 84032 jdk.internal.foreign.AbstractMemorySegmentImpl#<init>()
3.89% 83875 java.lang.Class#getComponentType()
3.59% 77332 org.apache.lucene.util.packed.DirectReader$DirectPackedReader12#get()
3.43% 74043 jdk.internal.misc.Unsafe#checkPrimitiveArray()
3.04% 65576 jdk.internal.foreign.AbstractMemorySegmentImpl#checkBounds()
2.84% 61133 jdk.internal.foreign.AbstractMemorySegmentImpl#checkBoundsSmall()
2.81% 60625 jdk.internal.misc.ScopedMemoryAccess#getShortUnalignedInternal()
2.59% 55777 jdk.internal.foreign.MappedMemorySegmentImpl#dup()
2.50% 53863 org.apache.lucene.store.MemorySegmentIndexInput#readBytes()
2.36% 50894 jdk.internal.foreign.Utils#filterSegment()
2.32% 49940 jdk.internal.foreign.HeapMemorySegmentImpl$OfByte#dup()
2.15% 46330 jdk.internal.foreign.AbstractMemorySegmentImpl#copyFrom()
1.92% 41393 org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegment()
1.65% 35535 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsPostingsEnum#advance()
1.27% 27288 jdk.internal.foreign.HeapMemorySegmentImpl$OfByte#<init>()
1.20% 25922 org.apache.lucene.index.SingletonSortedSetDocValues#nextDoc()
1.04% 22431 jdk.internal.foreign.HeapMemorySegmentImpl$OfByte#fromArray()
0.98% 21203 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#findFirstGreater()
0.94% 20165 jdk.internal.foreign.SharedScope#checkValidState()
0.92% 19927 jdk.internal.misc.ScopedMemoryAccess#copyMemoryInternal()
0.90% 19325 org.apache.lucene.queries.spans.SpanScorer#setFreqCurrentDoc()
0.85% 18220 org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#increment()
0.79% 16981 org.apache.lucene.search.ConjunctionDISI#doNext()
0.69% 14907 org.apache.lucene.search.PhraseScorer$1#matches()
0.68% 14599 org.apache.lucene.queries.spans.TermSpans#nextStartPosition()
0.67% 14537 org.apache.lucene.index.SingletonSortedSetDocValues#nextOrd()
0.67% 14451 jdk.internal.misc.ScopedMemoryAccess#getByteInternal()
CPU merged search profile for baseline:
JFR aggregation command: /home/jenkins/tools/java/64bit/latest-jdk17/bin/java --add-modules jdk.incubator.foreign -server -Xms2g -Xmx2g -XX:-TieredCompilation -XX:+HeapDumpOnOutOfMemoryError -Xbatch -cp /home/thetaphi/benchmark/lucene_baseline/buildSrc/build/classes/java/main -Dtests.profile.mode=cpu -Dtests.profile.stacksize=1 -Dtests.profile.count=30 org.apache.lucene.gradle.ProfileResults /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-3.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-17.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-16.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-13.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-12.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-5.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-18.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-9.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-1.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-19.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-4.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-2.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-8.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-11.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-10.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-14.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-15.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-0.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-7.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-6.jfr
Took 4.24 seconds
WARNING: Using incubator modules: jdk.incubator.foreign
PROFILE SUMMARY from 1752341 events (total: 1M)
tests.profile.mode=cpu
tests.profile.count=30
tests.profile.stacksize=1
tests.profile.linenumbers=false
PERCENT CPU SAMPLES STACK
13.25% 232242 org.apache.lucene.util.packed.DirectMonotonicReader#get()
8.95% 156849 org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts#countAll()
7.49% 131320 java.nio.ByteBuffer#getArray()
4.91% 85990 org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$15#binaryValue()
3.95% 69304 jdk.internal.misc.Unsafe#convEndian()
3.79% 66477 org.apache.lucene.util.packed.DirectReader$DirectPackedReader12#get()
3.31% 57980 org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegment()
3.20% 56142 org.apache.lucene.store.ByteBufferGuard#ensureValid()
2.91% 51021 java.nio.Buffer#scope()
2.16% 37770 java.nio.ByteBuffer#get()
2.01% 35217 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsPostingsEnum#advance()
1.71% 29978 org.apache.lucene.store.ByteBufferGuard#getBytes()
1.70% 29781 java.nio.Buffer#checkIndex()
1.55% 27212 org.apache.lucene.index.SingletonSortedSetDocValues#nextDoc()
1.21% 21264 org.apache.lucene.store.ByteBufferIndexInput#readBytes()
1.21% 21134 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#findFirstGreater()
1.11% 19411 org.apache.lucene.store.ByteBufferGuard#getShort()
1.09% 19159 org.apache.lucene.queries.spans.SpanScorer#setFreqCurrentDoc()
1.02% 17794 org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#increment()
1.01% 17766 java.nio.Buffer#position()
0.98% 17108 org.apache.lucene.index.SingletonSortedSetDocValues#nextOrd()
0.95% 16667 org.apache.lucene.search.PhraseScorer$1#matches()
0.87% 15251 org.apache.lucene.search.ConjunctionDISI#doNext()
0.86% 15055 org.apache.lucene.queries.spans.TermSpans#nextStartPosition()
0.77% 13498 org.apache.lucene.util.packed.DirectReader$DirectPackedReader4#get()
0.75% 13162 org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl#seek()
0.70% 12317 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$EverythingEnum#nextPosition()
0.68% 11907 org.apache.lucene.codecs.lucene90.ForUtil#expand8()
0.63% 11117 org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$19#ordValue()
0.62% 10813 org.apache.lucene.util.PriorityQueue#upHeap()
HEAP merged search profile for my_modified_version:
JFR aggregation command: /home/jenkins/tools/java/64bit/latest-jdk17/bin/java --add-modules jdk.incubator.foreign -server -Xms2g -Xmx2g -XX:-TieredCompilation -XX:+HeapDumpOnOutOfMemoryError -Xbatch -cp /home/thetaphi/benchmark/lucene_candidate/buildSrc/build/classes/java/main -Dtests.profile.mode=heap -Dtests.profile.stacksize=1 -Dtests.profile.count=30 org.apache.lucene.gradle.ProfileResults /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-12.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-8.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-11.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-18.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-17.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-0.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-14.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-10.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-4.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-5.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-16.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-2.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-6.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-15.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-1.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-19.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-13.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-7.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-9.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-3.jfr
Took 18.34 seconds
WARNING: Using incubator modules: jdk.incubator.foreign
PROFILE SUMMARY from 14415681 events (total: 5004408M)
tests.profile.mode=heap
tests.profile.count=30
tests.profile.stacksize=1
tests.profile.linenumbers=false
PERCENT HEAP SAMPLES STACK
55.11% 2757776M jdk.internal.foreign.HeapMemorySegmentImpl$OfByte#fromArray()
22.41% 1121632M jdk.internal.foreign.HeapMemorySegmentImpl$OfByte#dup()
20.85% 1043452M jdk.internal.foreign.MappedMemorySegmentImpl#dup()
0.71% 35428M jdk.internal.foreign.HeapMemorySegmentImpl$OfLong#dup()
0.18% 9206M jdk.internal.foreign.HeapMemorySegmentImpl$OfLong#fromArray()
0.12% 6150M org.apache.lucene.search.ExactPhraseMatcher$1$1#getImpacts()
0.09% 4729M org.apache.lucene.search.ExactPhraseMatcher$1#getImpacts()
0.09% 4620M org.apache.lucene.util.FixedBitSet#<init>()
0.06% 3099M java.util.AbstractList#iterator()
0.04% 1773M java.util.ArrayList#grow()
0.03% 1360M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame#<init>()
0.02% 1090M java.util.ArrayList#iterator()
0.02% 1085M org.apache.lucene.util.PriorityQueue#<init>()
0.02% 908M org.apache.lucene.util.ArrayUtil#growExact()
0.02% 796M org.apache.lucene.queryparser.charstream.FastCharStream#refill()
0.02% 787M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum#<init>()
0.01% 577M org.apache.lucene.util.BytesRef#<init>()
0.01% 537M jdk.internal.misc.Unsafe#allocateUninitializedArray()
0.01% 517M org.apache.lucene.util.fst.ByteSequenceOutputs#read()
0.01% 394M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#getFrame()
0.01% 370M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#newTermState()
0.01% 366M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#<init>()
0.01% 352M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#<init>()
0.01% 349M org.apache.lucene.queryparser.charstream.FastCharStream#GetImage()
0.01% 322M org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame#load()
0.01% 317M org.apache.lucene.codecs.lucene90.ForUtil#<init>()
0.01% 295M org.apache.lucene.store.MemorySegmentIndexInput#buildSlice()
0.01% 261M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsPostingsEnum#<init>()
0.00% 240M org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#document()
0.00% 237M java.util.Arrays#copyOfRange()
HEAP merged search profile for baseline:
JFR aggregation command: /home/jenkins/tools/java/64bit/latest-jdk17/bin/java --add-modules jdk.incubator.foreign -server -Xms2g -Xmx2g -XX:-TieredCompilation -XX:+HeapDumpOnOutOfMemoryError -Xbatch -cp /home/thetaphi/benchmark/lucene_baseline/buildSrc/build/classes/java/main -Dtests.profile.mode=heap -Dtests.profile.stacksize=1 -Dtests.profile.count=30 org.apache.lucene.gradle.ProfileResults /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-3.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-17.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-16.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-13.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-12.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-5.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-18.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-9.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-1.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-19.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-4.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-2.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-8.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-11.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-10.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-14.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-15.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-0.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-7.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-6.jfr
Took 2.55 seconds
WARNING: Using incubator modules: jdk.incubator.foreign
PROFILE SUMMARY from 103186 events (total: 37912M)
tests.profile.mode=heap
tests.profile.count=30
tests.profile.stacksize=1
tests.profile.linenumbers=false
PERCENT HEAP SAMPLES STACK
15.31% 5806M org.apache.lucene.search.ExactPhraseMatcher$1$1#getImpacts()
12.65% 4796M org.apache.lucene.search.ExactPhraseMatcher$1#getImpacts()
12.47% 4726M org.apache.lucene.util.FixedBitSet#<init>()
8.56% 3246M java.util.AbstractList#iterator()
5.25% 1990M java.util.ArrayList#grow()
3.74% 1419M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame#<init>()
2.92% 1108M org.apache.lucene.util.PriorityQueue#<init>()
2.59% 981M java.util.ArrayList#iterator()
2.35% 890M org.apache.lucene.queryparser.charstream.FastCharStream#refill()
2.32% 879M org.apache.lucene.util.ArrayUtil#growExact()
2.08% 789M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum#<init>()
1.64% 621M org.apache.lucene.util.BytesRef#<init>()
1.40% 530M org.apache.lucene.util.fst.ByteSequenceOutputs#read()
1.22% 461M jdk.internal.misc.Unsafe#allocateUninitializedArray()
1.21% 458M java.nio.DirectByteBufferR#duplicate()
1.14% 433M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#getFrame()
1.03% 388M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#<init>()
1.00% 377M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#newTermState()
0.92% 349M org.apache.lucene.codecs.lucene90.ForUtil#<init>()
0.87% 330M org.apache.lucene.queryparser.charstream.FastCharStream#GetImage()
0.87% 328M org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame#load()
0.83% 316M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#<init>()
0.70% 265M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsPostingsEnum#<init>()
0.63% 240M org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#document()
0.60% 229M java.nio.DirectByteBufferR#slice()
0.52% 198M org.apache.lucene.store.ByteBufferIndexInput#newCloneInstance()
0.51% 193M org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame#<init>()
0.48% 180M java.util.AbstractList#listIterator()
0.47% 179M org.apache.lucene.codecs.lucene90.Lucene90ScoreSkipReader#readImpacts()
0.47% 178M java.nio.DirectByteBufferR#asLongBuffer()
I also ran the same with tiered compilation turned on and no -Xbatch (java defaults). The results are much better, but still the heap allocations are done.
For long-running benchmarks with modern VMs, switching off tiered and enabling batch mode is a bad idea, as it is different from real world. Sure, results are more predictable, but they are noisy anyways (for different reasons, mainly because we don't use JMH). In addition, as soon as you use MethodHandles or VarHandles (as we do here), in addition to Hotspot's low level optimizations, the Java runtime also does lambda transformations of the MethodHandles and VarHarndles at runtime (it rewrites byte code), which cannot be controlled by the Java command line. All code using java Lambdas, dynamic invokes, MethodHandles, VarHandles (like this one) only work well after long runtime. So the benchmarking will stay noisy unless we use JMH. This got worse recently as Lucene code uses more and more Lambdas.
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
BrowseMonthTaxoFacets 1.06 (9.0%) 0.92 (7.1%) -13.3% ( -26% - 3%) 0.000
BrowseDateTaxoFacets 1.03 (9.5%) 0.91 (8.0%) -11.8% ( -26% - 6%) 0.000
BrowseDayOfYearTaxoFacets 1.03 (9.7%) 0.91 (8.0%) -11.8% ( -26% - 6%) 0.000
Respell 54.63 (1.8%) 51.37 (1.2%) -6.0% ( -8% - -3%) 0.000
PKLookup 206.57 (2.7%) 196.05 (2.3%) -5.1% ( -9% - 0%) 0.000
Fuzzy1 83.20 (1.4%) 78.97 (1.5%) -5.1% ( -7% - -2%) 0.000
HighTermDayOfYearSort 10.39 (17.7%) 9.95 (14.5%) -4.3% ( -30% - 33%) 0.399
Fuzzy2 50.01 (2.7%) 48.48 (2.3%) -3.1% ( -7% - 1%) 0.000
OrHighMed 54.71 (3.6%) 53.16 (4.0%) -2.8% ( -10% - 4%) 0.019
AndHighLow 927.79 (2.8%) 904.39 (5.4%) -2.5% ( -10% - 5%) 0.063
OrHighLow 457.95 (2.2%) 446.92 (4.6%) -2.4% ( -9% - 4%) 0.036
LowSloppyPhrase 35.92 (4.1%) 35.13 (3.4%) -2.2% ( -9% - 5%) 0.066
Prefix3 22.95 (12.6%) 22.46 (11.5%) -2.2% ( -23% - 25%) 0.571
BrowseMonthSSDVFacets 4.00 (1.4%) 3.91 (3.1%) -2.1% ( -6% - 2%) 0.006
HighTermMonthSort 30.76 (21.3%) 30.25 (16.2%) -1.7% ( -32% - 45%) 0.781
AndHighHigh 40.44 (5.4%) 39.84 (3.0%) -1.5% ( -9% - 7%) 0.280
HighSloppyPhrase 14.55 (4.8%) 14.35 (3.8%) -1.3% ( -9% - 7%) 0.334
HighTermTitleBDVSort 12.27 (22.5%) 12.13 (26.8%) -1.1% ( -41% - 62%) 0.884
MedSloppyPhrase 22.93 (4.1%) 22.70 (3.3%) -1.0% ( -8% - 6%) 0.396
MedSpanNear 14.88 (2.8%) 14.73 (2.6%) -1.0% ( -6% - 4%) 0.250
OrHighHigh 12.71 (4.6%) 12.61 (5.5%) -0.8% ( -10% - 9%) 0.615
AndHighMed 57.17 (5.9%) 56.80 (3.1%) -0.6% ( -9% - 8%) 0.669
TermDTSort 67.79 (9.2%) 67.44 (10.9%) -0.5% ( -18% - 21%) 0.872
BrowseDayOfYearSSDVFacets 3.88 (1.7%) 3.86 (3.0%) -0.5% ( -5% - 4%) 0.544
HighSpanNear 3.60 (3.7%) 3.60 (3.4%) -0.1% ( -6% - 7%) 0.936
MedPhrase 245.29 (2.5%) 249.61 (2.4%) 1.8% ( -3% - 6%) 0.023
LowSpanNear 246.46 (4.2%) 252.99 (3.1%) 2.7% ( -4% - 10%) 0.023
HighIntervalsOrdered 10.07 (6.1%) 10.43 (5.9%) 3.6% ( -7% - 16%) 0.059
OrNotHighLow 1014.03 (3.2%) 1054.08 (3.3%) 3.9% ( -2% - 10%) 0.000
OrNotHighMed 832.53 (2.7%) 867.52 (2.9%) 4.2% ( -1% - 10%) 0.000
Wildcard 59.59 (23.1%) 62.30 (18.1%) 4.5% ( -29% - 59%) 0.489
HighTerm 1310.56 (6.1%) 1370.47 (5.0%) 4.6% ( -6% - 16%) 0.010
LowPhrase 177.09 (2.1%) 185.99 (2.8%) 5.0% ( 0% - 10%) 0.000
MedTerm 1657.72 (6.6%) 1756.16 (4.5%) 5.9% ( -4% - 18%) 0.001
LowTerm 1665.17 (3.6%) 1764.34 (3.6%) 6.0% ( -1% - 13%) 0.000
OrHighNotHigh 928.06 (5.2%) 983.69 (4.7%) 6.0% ( -3% - 16%) 0.000
HighPhrase 278.47 (2.5%) 295.20 (3.7%) 6.0% ( 0% - 12%) 0.000
OrNotHighHigh 945.96 (5.0%) 1008.94 (4.1%) 6.7% ( -2% - 16%) 0.000
OrHighNotLow 963.35 (4.9%) 1035.97 (4.7%) 7.5% ( -1% - 18%) 0.000
OrHighNotMed 1203.88 (3.9%) 1316.54 (4.3%) 9.4% ( 1% - 18%) 0.000
IntNRQ 165.56 (7.0%) 181.66 (14.1%) 9.7% ( -10% - 33%) 0.006
CPU merged search profile for my_modified_version:
JFR aggregation command: /home/jenkins/tools/java/64bit/latest-jdk17/bin/java --add-modules jdk.incubator.foreign -server -Xms2g -Xmx2g -XX:+HeapDumpOnOutOfMemoryError -cp /home/thetaphi/benchmark/lucene_candidate/buildSrc/build/classes/java/main -Dtests.profile.mode=cpu -Dtests.profile.stacksize=1 -Dtests.profile.count=30 org.apache.lucene.gradle.ProfileResults /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-12.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-8.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-11.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-18.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-17.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-0.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-14.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-10.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-4.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-5.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-16.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-2.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-6.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-15.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-1.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-19.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-13.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-7.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-9.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-3.jfr
Took 3.09 seconds
WARNING: Using incubator modules: jdk.incubator.foreign
PROFILE SUMMARY from 1507032 events (total: 1M)
tests.profile.mode=cpu
tests.profile.count=30
tests.profile.stacksize=1
tests.profile.linenumbers=false
PERCENT CPU SAMPLES STACK
12.74% 191932 org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$15#binaryValue()
12.50% 188370 org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts#countAll()
10.97% 165315 org.apache.lucene.util.packed.DirectMonotonicReader#get()
6.06% 91320 org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegment()
4.75% 71593 org.apache.lucene.util.packed.DirectReader$DirectPackedReader12#get()
2.97% 44706 jdk.internal.foreign.AbstractMemorySegmentImpl#checkBounds()
2.67% 40194 jdk.internal.foreign.AbstractMemorySegmentImpl#checkBoundsSmall()
2.30% 34656 jdk.internal.misc.ScopedMemoryAccess#getShortUnalignedInternal()
1.83% 27580 jdk.internal.foreign.AbstractMemorySegmentImpl#isSet()
1.78% 26834 jdk.internal.foreign.AbstractMemorySegmentImpl#<init>()
1.67% 25130 org.apache.lucene.store.MemorySegmentIndexInput$SingleSegmentImpl#seek()
1.50% 22648 org.apache.lucene.store.MemorySegmentIndexInput#readBytes()
1.28% 19257 jdk.internal.foreign.HeapMemorySegmentImpl$OfByte#fromArray()
1.17% 17642 org.apache.lucene.search.Weight$DefaultBulkScorer#scoreAll()
1.15% 17288 org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#increment()
1.10% 16524 java.lang.invoke.VarHandleGuards#guard_LJ_I()
1.06% 16019 jdk.internal.foreign.HeapMemorySegmentImpl#<init>()
1.03% 15456 org.apache.lucene.store.MemorySegmentIndexInput#ensureOpen()
0.84% 12628 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$EverythingEnum#nextPosition()
0.81% 12232 jdk.internal.misc.ScopedMemoryAccess#getByteInternal()
0.77% 11643 jdk.internal.foreign.AbstractMemorySegmentImpl#asSlice()
0.72% 10884 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$EverythingEnum#skipPositions()
0.70% 10522 jdk.internal.foreign.SharedScope#checkValidState()
0.60% 8991 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$EverythingEnum#advance()
0.59% 8844 org.apache.lucene.store.MemorySegmentIndexInput#readByte()
0.57% 8581 org.apache.lucene.util.packed.DirectReader$DirectPackedReader4#get()
0.56% 8393 org.apache.lucene.queries.spans.NearSpansOrdered#stretchToOrder()
0.56% 8393 jdk.internal.foreign.AbstractMemorySegmentImpl#isSmall()
0.55% 8343 org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$DenseBinaryDocValues#nextDoc()
0.53% 7941 java.util.Objects#checkIndex()
CPU merged search profile for baseline:
JFR aggregation command: /home/jenkins/tools/java/64bit/latest-jdk17/bin/java --add-modules jdk.incubator.foreign -server -Xms2g -Xmx2g -XX:+HeapDumpOnOutOfMemoryError -cp /home/thetaphi/benchmark/lucene_baseline/buildSrc/build/classes/java/main -Dtests.profile.mode=cpu -Dtests.profile.stacksize=1 -Dtests.profile.count=30 org.apache.lucene.gradle.ProfileResults /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-3.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-17.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-16.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-13.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-12.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-5.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-18.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-9.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-1.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-19.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-4.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-2.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-8.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-11.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-10.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-14.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-15.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-0.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-7.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-6.jfr
Took 2.81 seconds
WARNING: Using incubator modules: jdk.incubator.foreign
PROFILE SUMMARY from 1429487 events (total: 1M)
tests.profile.mode=cpu
tests.profile.count=30
tests.profile.stacksize=1
tests.profile.linenumbers=false
PERCENT CPU SAMPLES STACK
11.27% 161081 org.apache.lucene.util.packed.DirectMonotonicReader#get()
9.67% 138204 org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts#countAll()
8.35% 119320 java.nio.ByteBuffer#getArray()
7.91% 113143 org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$15#binaryValue()
6.74% 96408 org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegment()
5.02% 71811 org.apache.lucene.util.packed.DirectReader$DirectPackedReader12#get()
3.79% 54165 jdk.internal.misc.Unsafe#convEndian()
2.35% 33565 java.nio.Buffer#position()
2.19% 31296 org.apache.lucene.store.ByteBufferIndexInput#readBytes()
2.00% 28556 jdk.internal.util.Preconditions#checkFromIndexSize()
1.34% 19210 org.apache.lucene.store.ByteBufferGuard#getShort()
1.27% 18218 org.apache.lucene.search.Weight$DefaultBulkScorer#scoreAll()
1.17% 16750 java.nio.Buffer#checkIndex()
1.14% 16264 org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#increment()
1.14% 16239 org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$DenseBinaryDocValues#nextDoc()
0.94% 13392 org.apache.lucene.util.packed.DirectReader$DirectPackedReader4#get()
0.84% 11986 java.nio.DirectByteBuffer#ix()
0.83% 11804 java.util.ArrayList$Itr#hasNext()
0.81% 11549 org.apache.lucene.store.ByteBufferGuard#ensureValid()
0.78% 11147 java.nio.Buffer#scope()
0.74% 10638 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$EverythingEnum#nextPosition()
0.73% 10480 org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl#seek()
0.65% 9283 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$EverythingEnum#skipPositions()
0.60% 8511 jdk.internal.misc.ScopedMemoryAccess#copyMemory()
0.59% 8453 org.apache.lucene.search.ConjunctionDISI#doNext()
0.57% 8125 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$EverythingEnum#advance()
0.51% 7320 org.apache.lucene.queries.spans.NearSpansOrdered#stretchToOrder()
0.50% 7206 org.apache.lucene.store.ByteBufferGuard#getByte()
0.48% 6832 org.apache.lucene.queries.spans.TermSpans#nextStartPosition()
0.47% 6765 org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$14#binaryValue()
HEAP merged search profile for my_modified_version:
JFR aggregation command: /home/jenkins/tools/java/64bit/latest-jdk17/bin/java --add-modules jdk.incubator.foreign -server -Xms2g -Xmx2g -XX:+HeapDumpOnOutOfMemoryError -cp /home/thetaphi/benchmark/lucene_candidate/buildSrc/build/classes/java/main -Dtests.profile.mode=heap -Dtests.profile.stacksize=1 -Dtests.profile.count=30 org.apache.lucene.gradle.ProfileResults /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-12.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-8.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-11.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-18.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-17.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-0.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-14.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-10.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-4.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-5.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-16.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-2.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-6.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-15.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-1.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-19.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-13.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-7.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-9.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-my_modified_version-3.jfr
Took 5.80 seconds
WARNING: Using incubator modules: jdk.incubator.foreign
PROFILE SUMMARY from 4729188 events (total: 1646537M)
tests.profile.mode=heap
tests.profile.count=30
tests.profile.stacksize=1
tests.profile.linenumbers=false
PERCENT HEAP SAMPLES STACK
98.51% 1621948M jdk.internal.foreign.HeapMemorySegmentImpl$OfByte#fromArray()
0.28% 4680M org.apache.lucene.util.FixedBitSet#<init>()
0.09% 1533M java.util.AbstractList#iterator()
0.08% 1371M jdk.internal.foreign.MappedMemorySegmentImpl#dup()
0.08% 1308M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame#<init>()
0.06% 932M org.apache.lucene.util.ArrayUtil#growExact()
0.04% 667M org.apache.lucene.util.BytesRef#<init>()
0.04% 659M org.apache.lucene.queryparser.charstream.FastCharStream#refill()
0.04% 622M jdk.internal.foreign.HeapMemorySegmentImpl$OfLong#dup()
0.04% 597M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum#<init>()
0.03% 568M org.apache.lucene.util.fst.ByteSequenceOutputs#read()
0.03% 524M org.apache.lucene.search.ExactPhraseMatcher$1$1#getImpacts()
0.03% 470M jdk.internal.misc.Unsafe#allocateUninitializedArray()
0.03% 467M org.apache.lucene.search.ExactPhraseMatcher$1#getImpacts()
0.03% 438M jdk.internal.foreign.HeapMemorySegmentImpl$OfByte#dup()
0.02% 411M org.apache.lucene.queryparser.charstream.FastCharStream#GetImage()
0.02% 397M org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame#load()
0.02% 388M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#getFrame()
0.02% 381M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#newTermState()
0.02% 367M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#<init>()
0.02% 307M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#<init>()
0.02% 282M org.apache.lucene.codecs.lucene90.ForUtil#<init>()
0.02% 282M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsPostingsEnum#<init>()
0.02% 274M org.apache.lucene.store.MemorySegmentIndexInput#buildSlice()
0.02% 252M java.util.AbstractList#listIterator()
0.02% 250M java.util.ArrayList#iterator()
0.01% 240M java.util.ArrayList#grow()
0.01% 216M java.util.Arrays#copyOfRange()
0.01% 206M org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#document()
0.01% 199M java.util.Arrays#asList()
HEAP merged search profile for baseline:
JFR aggregation command: /home/jenkins/tools/java/64bit/latest-jdk17/bin/java --add-modules jdk.incubator.foreign -server -Xms2g -Xmx2g -XX:+HeapDumpOnOutOfMemoryError -cp /home/thetaphi/benchmark/lucene_baseline/buildSrc/build/classes/java/main -Dtests.profile.mode=heap -Dtests.profile.stacksize=1 -Dtests.profile.count=30 org.apache.lucene.gradle.ProfileResults /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-3.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-17.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-16.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-13.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-12.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-5.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-18.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-9.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-1.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-19.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-4.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-2.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-8.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-11.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-10.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-14.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-15.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-0.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-7.jfr /home/thetaphi/benchmark/util/bench-search-baseline_vs_patch-baseline-6.jfr
Took 1.32 seconds
WARNING: Using incubator modules: jdk.incubator.foreign
PROFILE SUMMARY from 62873 events (total: 23297M)
tests.profile.mode=heap
tests.profile.count=30
tests.profile.stacksize=1
tests.profile.linenumbers=false
PERCENT HEAP SAMPLES STACK
20.54% 4785M org.apache.lucene.util.FixedBitSet#<init>()
6.85% 1596M java.util.AbstractList#iterator()
5.78% 1346M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame#<init>()
4.29% 999M org.apache.lucene.util.ArrayUtil#growExact()
3.07% 714M org.apache.lucene.queryparser.charstream.FastCharStream#refill()
2.84% 660M org.apache.lucene.util.BytesRef#<init>()
2.78% 648M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum#<init>()
2.51% 585M org.apache.lucene.util.fst.ByteSequenceOutputs#read()
2.30% 536M org.apache.lucene.search.ExactPhraseMatcher$1$1#getImpacts()
2.03% 473M org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame#load()
1.98% 462M org.apache.lucene.search.ExactPhraseMatcher$1#getImpacts()
1.92% 448M jdk.internal.misc.Unsafe#allocateUninitializedArray()
1.91% 444M org.apache.lucene.queryparser.charstream.FastCharStream#GetImage()
1.89% 441M java.nio.DirectByteBufferR#duplicate()
1.72% 399M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#newTermState()
1.60% 372M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#getFrame()
1.53% 355M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#<init>()
1.22% 284M org.apache.lucene.codecs.lucene90.ForUtil#<init>()
1.15% 268M java.util.ArrayList#iterator()
1.10% 257M java.util.ArrayList#grow()
1.08% 250M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#<init>()
1.07% 249M java.util.AbstractList#listIterator()
1.03% 239M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsPostingsEnum#<init>()
1.00% 234M java.nio.DirectByteBufferR#slice()
0.92% 214M org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#document()
0.88% 205M java.util.Arrays#asList()
0.86% 201M java.util.Arrays#copyOf()
0.81% 189M org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame#<init>()
0.81% 187M org.apache.lucene.store.ByteBufferIndexInput#newCloneInstance()
0.77% 179M java.nio.DirectByteBufferR#asLongBuffer()
I opened https://bugs.openjdk.java.net/browse/JDK-8268743 about the object allocations.
The issue is confirmed and for the readBytes() code there's already a workaround. Long term we will improve to have this for all array types. See discussion here: https://github.com/openjdk/panama-foreign/pull/560
For the float and long options: copyMemory() has some overhead for small array sizes, so the recommendation by Panama and Hotspot engineers is to remove the specialization from Lucene at all. readLongs() will only read a maximum of 64 longs, a loop with removed bounds checks will do much better here than an explicit memory copy.
For vectors I have the same feeling, but there we should go in direction of removing readFloats() alltogether and replace by a method returning a FloatVector() view instead.
In the next Panama iteration, there will also be a ready-to use copy method, which has same shape as methods added for bulk copy: https://github.com/openjdk/panama-foreign/pull/555
There is no fix for the JDK 17 issues with garbage collection and disabled tiered compilation. From Java 18 on we can use the new MemoryCopy class to do bulk copies. It also supports byte swap to adjust endianness: https://github.com/openjdk/panama-foreign/pull/555
The results look promising. I have a branch using that available already, will create a new pull request to record the iterations.
Still we need to investiage if for the readLongs() and readFloats() the loop is better, as the array sizes are really small and the bulk copy overhead is large. I can see t already in the benchmarks.
@mikemccand It would be better to have the tool respect
RUNTIME_JAVA_HOMElike gradlew does.
+1, that sounds like a good idea! Maybe open an issue in luceneutil directly?
Hi @mikemccand, I will open a few more issues in luceneutil. The way how it invokes the JVM is not like it is done in production:
-Xbatchas default doesn't help for reproducibility and makes the results problematic for production use, as unrealistic.-XX:-TieredCompilationas default is only useful to run tests (so we don't spend too much time for recompile), but as it is default in modern JDKS since Java 8, it should really be swritched on for benchmarks. Benchmarks run long time and they get real speed boosts by this. While testing this pull request, disabling tiered compilation made results 20% worse because it prevented escape analysis from kicking in early enough.
With modern JVMs, doing batch compilation brings nothing, as a lot of stuff is optimized now also on the Java side, not just in Hotspot. E.g., when a lambda or functional method reference is used with stream() API, they are also optimized at runtime outside hotspot by rewriting it to temporary classes and optimized byte code (lambda trasformations, invokedynamic).
So in general the benchmark should mimic real-live. You are repeating the runs several times anyways, so changing the optimizations to be non-default is not a good idea, because your settings do not have less fluctuation don't help, they slow down!
I have no idea why the facetting stuff at the beginning of the bench output is so badly behaving with MMapDirectory#v2 on top of project panama, I'll ignore this for now. Maybe @mikemccand has an idea! The rest looks perfectly fine to me from the performance (19 runs).
Yeah that is strange. The impact is sizable. These tasks effectively run a MatchAllDocsQuery, and for each hit (doc) decodes the packed delta-coded int ordinals from a BinaryDocValues field, incrementing counters in a big int[]. To the MMapDirectory it should look like it's doing a big sequential scan of that BinaryDocValues field for each segment. Surely this should be easy for mmap ;) Maybe we should run profiler ... maybe something silly is happening, like we are randomly seeking on every hit unnecessarily ...
So in general the benchmark should mimic real-live.
+1 -- nightly benchmarks already overrides this (poor) default, so let's fix constants.py to not pass -Xbatch nor -XX:-TieredCompilation anymore!
So the benchmarking will stay noisy unless we use JMH.
Maybe we should switch luceneutil to JMH? Or, poach ideas from it? How is it reducing JVM/hotspot noise without passing unrealistic JVM flags :)
Hi @mikemccand,
I have no idea why the facetting stuff at the beginning of the bench output is so badly behaving with MMapDirectory#v2 on top of project panama, I'll ignore this for now. Maybe @mikemccand has an idea! The rest looks perfectly fine to me from the performance (19 runs).
Yeah that is strange. The impact is sizable. These tasks effectively run a
MatchAllDocsQuery, and for each hit (doc) decodes the packed delta-coded int ordinals from aBinaryDocValuesfield, incrementing counters in a bigint[]. To theMMapDirectoryit should look like it's doing a big sequential scan of thatBinaryDocValuesfield for each segment. Surely this should be easy formmap;) Maybe we should run profiler ... maybe something silly is happening, like we are randomly seeking on every hit unnecessarily ...
This is already understood. The problem is the combination of bad JVM defaults and a problem in bulk copy from off-heap to heap! What you see is garbage collector driving crazy (see my flight control heap dumps). The disabled tiered compilation made hotspot optimizations and escape analysis only kick in after 6 seconds benchmark runtime (of total 60s on my machine). In this time it produced 3 java objects per copy operation (readBytes(), readLongs(), readFloats()). With tiered compilation this went down, but still 1 second producing tons of garbage.
With JDK 18 this will be solved in https://github.com/openjdk/panama-foreign/pull/555, where we get a new class MemoryCopy that allows to copy data off-heap to on-heap without allocating useless wrapper classes. In this pull request this line is producing 3 temporary objects: https://github.com/apache/lucene/blob/bacfa11940ecf505299ec8a2d43a94d64a106182/lucene/core/src/java/org/apache/lucene/store/MemorySegmentIndexInput.java#L167
MemorySegment.ofArray(b).asSlice(offset, len).copyFrom(curSegment.asSlice(curPosition, len));
From JDK 18 on, it will look like this:
MemoryCopy.copyToArray(curSegment, curPosition, b, offset, len);
This was negotiated today 👍
In general, we should avoid to copy bytes/longs/floats from mmap to on-heap. With the new vector API coming soon, there's no need for that. Just tell vector API to calculate something (e.g. PFOR or cosine) and give it a MemorySegment slice :-) So we should really really avoid all those stupid copy operations in bulks!!!
To read the whole story check here: https://github.com/openjdk/panama-foreign/pull/555#issuecomment-865672909 - there are also uptodate benchmarks.
So the benchmarking will stay noisy unless we use JMH.
Maybe we should switch
luceneutilto JMH? Or, poach ideas from it? How is it reducing JVM/hotspot noise without passing unrealistic JVM flags :)
I discussed with Robert about this. Yes, we should do this! But still we should do "real-world" benchmarks.
This was negotiated today 👍
To read the whole story check here: openjdk/panama-foreign#555 (comment)
Wow, awesome! I love that your efforts to get Lucene working well on top of these new Java-friendly mmap APIs is uncovering things to fix / iterations to iterate, thanks to unexpectedly slower Lucene faceting!
luceneutil throws away first few iterations of each task to allow for "warmup", but I guess it was not enough here.
Maybe we could fix luceneutil to do this more dynamically, e.g. introspect on when hotspot has "mostly" finished compiling the hotspots, and things have reached steady state, instead of the fixed "discard first N results per task"?
This was negotiated today 👍 To read the whole story check here: openjdk/panama-foreign#555 (comment)
Wow, awesome! I love that your efforts to get Lucene working well on top of these new Java-friendly
mmapAPIs is uncovering things to fix / iterations to iterate, thanks to unexpectedly slower Lucene faceting!
luceneutilthrows away first few iterations of each task to allow for "warmup", but I guess it was not enough here.Maybe we could fix
luceneutilto do this more dynamically, e.g. introspect on when hotspot has "mostly" finished compiling the hotspots, and things have reached steady state, instead of the fixed "discard first N results per task"?
I think this is what JMH should help us to do. The problem with luceneutil is also that it respawns a JVM multiple times. Maybe it should just execute more queries (the same ones in a loop) in the same JVM (like running 20 times as long instead of 20 separate JVMs) and then drop like half of the measurements. The problem is that Hotspot is optimizing all the time and it's hard to figure out when it is done.
The problem with luceneutil is also that it respawns a JVM multiple times.
Hmm, we added multiple JVMs long ago precisely because HotSpot was so unpredictable. I.e. we had clear examples where HotSpot would paint itself into a corner, compiling e.g. readVInt poorly and never re-compiling it, or something, such that no matter how long the benchmark ran, it would never reach as good performance as if you simply restarted the whole JVM and rolled the dice again. But maybe this situation has been improved and these were somehow early HotSpot bugs/issues and we could really remove multiple JVMs without harming how accurately we can extract the mean/variance performance of all our benchmark tasks?
The problem with luceneutil is also that it respawns a JVM multiple times.
Hmm, we added multiple JVMs long ago precisely because HotSpot was so unpredictable. I.e. we had clear examples where HotSpot would paint itself into a corner, compiling e.g.
readVIntpoorly and never re-compiling it, or something, such that no matter how long the benchmark ran, it would never reach as good performance as if you simply restarted the whole JVM and rolled the dice again. But maybe this situation has been improved and these were somehow early HotSpot bugs/issues and we could really remove multiple JVMs without harming how accurately we can extract the mean/variance performance of all our benchmark tasks?
This is also not reality: Would you restart your Elasticsearch server from time to time because you think there might be a broken readVInt() optimization?
Here are the berlinbuzzwords slides about this: https://2021.berlinbuzzwords.de/sites/berlinbuzzwords.de/files/2021-06/The%20future%20of%20Lucene%27s%20MMapDirectory.pdf
Hmm, we added multiple JVMs long ago precisely because HotSpot was so unpredictable. I.e. we had clear examples where HotSpot would paint itself into a corner, compiling e.g. readVInt poorly and never re-compiling it, or something, such that no matter how long the benchmark ran, it would never reach as good performance as if you simply restarted the whole JVM and rolled the dice again. But maybe this situation has been improved and these were somehow early HotSpot bugs/issues and we could really remove multiple JVMs without harming how accurately we can extract the mean/variance performance of all our benchmark tasks?
the JMH will do this too. I forget the defaults, but uses multiple jvm iterations and iterations within each jvm and warmup iterations. But it has smarts around the JIT compiler and can dump profiled assembly for its microbenchmarks. I never have noise issues with it.
I think the issue is like "unit test" versus "integration test".
The current big "integration test" (lucene util) is useful for some things: e.g. something has to tell us there is pollution from too many java abstractions going megamorphic and so on :) But I think it would be improved by providing some more diagnostics (LogCompilation or whatever, maybe JIT stats in the JFR output). Let it be a "canary" to find little ways to improve.
But we have nothing setup to do simple noise-free microbenchmarks over some specific code, e.g. like "unit tests" running different query types. And for those you don't want crazy JFR and logging and stuff as it is so targeted, you can just dump the hot assembly code instead. For now if you want to do this, you are writing one-off stuff yourself.
The problem with luceneutil is also that it respawns a JVM multiple times.
Hmm, we added multiple JVMs long ago precisely because HotSpot was so unpredictable. I.e. we had clear examples where HotSpot would paint itself into a corner, compiling e.g.
readVIntpoorly and never re-compiling it, or something, such that no matter how long the benchmark ran, it would never reach as good performance as if you simply restarted the whole JVM and rolled the dice again. But maybe this situation has been improved and these were somehow early HotSpot bugs/issues and we could really remove multiple JVMs without harming how accurately we can extract the mean/variance performance of all our benchmark tasks?This is also not reality: Would you restart your Elasticsearch server from time to time because you think there might be a broken
readVInt()optimization?
Yeah, that is true! But perhaps it shouldn't be the case :) Maybe Elasticsearch/OpenSearch/Solr should spawn JVM a few times until they get a "good" readVInt compilation! The noisy mis-compilation was such a sizable impact (back then, hopefully not anymore?).
If we only ran benchmarks in nightly runs so that we could see that noise/variance with time, maybe we could do just one JVM.
But when a developer is trying to test an exciting optimization, in the privacy of their git clone, it really sucks to have hotspot noise completely drown out any small gains your optimization might show!
Benchmarking is hard :)
Here are the berlinbuzzwords slides about this: https://2021.berlinbuzzwords.de/sites/berlinbuzzwords.de/files/2021-06/The%20future%20of%20Lucene%27s%20MMapDirectory.pdf
Oooh, thanks for sharing! The talk looks AWESOME! I will watch recording when it's out :) You should share these slides on Twitter too?
the JMH will do this too. I forget the defaults, but uses multiple jvm iterations and iterations within each jvm and warmup iterations. But it has smarts around the JIT compiler and can dump profiled assembly for its microbenchmarks. I never have noise issues with it.
Excellent!
The current big "integration test" (lucene util) is useful for some things: e.g. something has to tell us there is pollution from too many java abstractions going megamorphic and so on :)
+1
It really is more of an integration test, yeah. It runs many different kinds of queries/tasks, concurrently across multiple threads, trying to exercise Lucene roughly in a way that OpenSearch/Elasticsearch/Solr might. Though, it does not do concurrent indexing with searching in a single JVM, at least not with the default benchmarks. Really, distributed search engines should not do that -- they should rather use Lucene's near-real-time segment replication, which is more efficient if you have deep replicas, and also enables strong physical isolation of indexing and searching JVMs which have very different resources requirements! OK </soapbox>!
But I think it would be improved by providing some more diagnostics (LogCompilation or whatever, maybe JIT stats in the JFR output). Let it be a "canary" to find little ways to improve.
+1. I wonder if we could tap into those in real-time and get a sense of when the JVM really is roughly "warmed up", instead of the static "discard first N samples for each task" that we do now. Or maybe to detect mis-compilation of readVInt!
But we have nothing setup to do simple noise-free microbenchmarks over some specific code, e.g. like "unit tests" running different query types. And for those you don't want crazy JFR and logging and stuff as it is so targeted, you can just dump the hot assembly code instead. For now if you want to do this, you are writing one-off stuff yourself.
Yeah maybe consing up a quick JMH for such cases is perfectly fine solution for we developers?
Here are the berlinbuzzwords slides about this: https://2021.berlinbuzzwords.de/sites/berlinbuzzwords.de/files/2021-06/The%20future%20of%20Lucene%27s%20MMapDirectory.pdf
Oooh, thanks for sharing! The talk looks AWESOME! I will watch recording when it's out :) You should share these slides on Twitter too?
I got he link yesterday by Nina, will post it later on twitter. I just had no time.
Yeah, that is true! But perhaps it shouldn't be the case :) Maybe Elasticsearch/OpenSearch/Solr should spawn JVM a few times until they get a "good" readVInt compilation! The noisy mis-compilation was such a sizable impact (back then, hopefully not anymore?).
If that is still there, show dumps of assembly and I will for sure open a bug report. This should not happen. At least not with tiered compilation. If you use batch compilation, of course it could be problematic, because it can't "re-optimize" easily. It has to wait for a trap caused by a wrong assumption and switch to interpreter first before trying agin.
But I think it would be improved by providing some more diagnostics (LogCompilation or whatever, maybe JIT stats in the JFR output). Let it be a "canary" to find little ways to improve.
+1. I wonder if we could tap into those in real-time and get a sense of when the JVM really is roughly "warmed up", instead of the static "discard first N samples for each task" that we do now. Or maybe to detect mis-compilation of
readVInt!
Unfortunately, you can't get the compilation events from inside the JVM, but with the help of the outer python process it might be possible:
The inner java process just benchmarks every round/query and does not throw away anything. After each round it prints the information in "machine readable form" to stdout. In addition we turn on -XX:+PrintCompilation on the JVM command line.
The outer python process just reads process output and reacts to events:
- if a benchmark query was finished it records the machine readable number
- if it gets a compilation event on stdout (some regex can catch it), it greps for some "hot method" like "readVInt" and once it sees a compilation event (with tiered you jave to look for compilation stage 4, also known as C2), it switches the flag and from now on it can use the numbers recorded
That's just an idea.
That's just an idea.
I like this idea! We could maybe identify "known bad hotspot compilation patterns" and retry the benchmark.
But the problem is (as you pointed out earlier), such mis-compilation also happens in "real" production usage, and including those (annoyingly bad) datapoints in nightly benchmarks would help us gauge how often this happens.
Hi @uschindler, I get the following compilation errors when I build your patch (Gradle build 6.3.8)
Task :altJvmWarning NOTE: Alternative java toolchain will be used for compilation and tests: Project will use Java 17 from: /mnt/c/Users/jatin/home-ubuntu/softwares/jdk-17 Gradle runs with Java 14 from: /usr/lib/jvm/java-14-openjdk-amd64 Task :errorProneSkipped WARNING: errorprone disabled (skipped on non-nightly runs) Task :lucene:core:compileJava warning: using incubating module(s): jdk.incubator.foreign /mnt/c/Users/jatin/home-ubuntu/sandboxes/vectorAPI-effort/lucene-jdk-foreign/lucene/lucene/core/src/java/org/apache/lucene/store/MemorySegmentIndexInput.java:27: error: cannot find symbol import jdk.incubator.foreign.ResourceScope; ^ symbol: class ResourceScope location: package jdk.incubator.foreign /mnt/c/Users/jatin/home-ubuntu/sandboxes/vectorAPI-effort/lucene-jdk-foreign/lucene/lucene/core/src/java/org/apache/lucene/store/MemorySegmentIndexInput.java:51: error: cannot find symbol final ResourceScope scope; ^ symbol: class ResourceScope
Your help will be greatly appriciated. Best Regards
Hi,
the current branch works with the release candidate builds of OpenJDK 17. If you get errors like this, you may possibly use a OpenJDK 17 version before Early Access Build 25.
Ca you post java -versionand tell us if you maybe compiledyour own version?