lucene
lucene copied to clipboard
Initial rewrite of MMapDirectory for JDK-18 preview (incubating) Panama APIs (>= JDK-18-ea-b26)
INFO: This is a followup of #177: It's the same code base, but with API changes from JDK 18 applied
This is just a draft PR for a first insight on memory mapping improvements in JDK 18+.
Some background information: Starting with JDK-14, there is a new incubating module "jdk.incubator.foreign" that has a new, not yet stable API for accessing off-heap memory (and later it will also support calling functions using classical MethodHandles that are located in libraries like .so or .dll files). This incubator module has several versions:
- first version: https://openjdk.java.net/jeps/370 (slow, very buggy and thread confinement, so making it unuseable with Lucene)
- second version: https://openjdk.java.net/jeps/383 (still thread confinement, but now allows transfer of "ownership" to other threads; this is still impossible to use with Lucene.
- third version in JDK 16: https://openjdk.java.net/jeps/393 (this version has included "Support for shared segments"). This now allows us to safely use the same external mmaped memory from different threads and also unmap it! This was implemented in the previous pull request #173
- fourth version in JDK 17: https://openjdk.java.net/jeps/412 . This mainly changes the API around the scopes. Instead of having segments explicitely made "shared", we can assign them to some resource scope which control their behaviour. The resourceScope is produced one time for each IndexInput instance (not clones) and owns all segments. When the resourceScope is closed, all segments get invalid - and we throw
AlreadyClosedException. The big problem is slowness due to heavy use of new instances just to copy memory between segments and java heap. This drives garbage collector crazy. This was implemented in previous PR #177 - fifth version in JDK 18, included in build 26: https://openjdk.java.net/jeps/419 (actual version). This mainly cleans up the API. From Lucene's persepctive the
MemorySegmentAPI now hasSystem.arraycopy()-like APIs to copy memory between heap and memory segments. This improves speed. It also handles byte-swapping automatically. This version of the PR also usesValueLayoutinstead of varhandles, as it makes code more readable and type-safe.
This module more or less overcomes several problems:
- ByteBuffer API is limited to 32bit (in fact MMapDirectory has to chunk in 1 GiB portions)
- There is no official way to unmap ByteBuffers when the file is no longer used. There is a way to use
sun.misc.Unsafeand forcefully unmap segments, but any IndexInput accessing the file from another thread will crush the JVM with SIGSEGV or SIGBUS. We learned to live with that and we happily apply the unsafe unmapping, but that's the main issue.
@uschindler had many discussions with the team at OpenJDK and finally with the third incubator, we have an API that works with Lucene. It was very fruitful discussions (thanks to @mcimadamore !)
With the third incubator we are now finally able to do some tests (especially performance). As this is an incubating module, this PR first changes a bit the build system:
- disable
-Werrorfor:lucene:core - add the incubating module to compiler of
:lucene:coreand enable it for all test builds. This is important, as you have to pass--add-modules jdk.incubator.foreignalso at runtime!
The code basically just modifies MMapDirectory to use LONG instead of INT for the chunk size parameter. In addition it adds MemorySegmentIndexInput that is a copy of our ByteBufferIndexInput (still there, but unused), but using MemorySegment instead of ByteBuffer behind the scenes. It works in exactly the same way, just the try/catch blocks for supporting EOFException or moving to another segment were rewritten.
It passes all tests and it looks like you can use it to read indexes. The default chunk size is now 16 GiB (but you can raise or lower it as you like; tests are doing this). Of course you can set it to Long.MAX_VALUE, in that case every index file is always mapped to one big memory mapping. My testing with Windows 10 have shown, that this is not a good idea!!!. Huge mappings fragment address space over time and as we can only use like 43 or 46 bits (depending on OS), the fragmentation will at some point kill you. So 16 GiB looks like a good compromise: Most files will be smaller than 6 GiB anyways (unless you optimize your index to one huge segment). So for most Lucene installations, the number of segments will equal the number of open files, so Elasticsearch huge user consumers will be very happy. The sysctl max_map_count may not need to be touched anymore.
In addition, this implements readLongs in a better way than @jpountz did (no caching or arbitrary objects). The new foreign-vector APIs will in future also be written with MemorySegment in its focus. So you can allocate a vector view on a MemorySegment and let the vectorizer fully work outside java heap inside our mmapped files! :-)_
It would be good if you could checkout this branch and try it in production.
According to speed tests it should be as fast as MMAPDirectory, partially also faster because less switching between byte-buffers is needed. With recent optimizations also long-based absolute access in loops should be faster.
But be aware:
- You need JDK 11 or JDK 17 to run Gradle (set
JAVA_HOMEto it) - You need JDK 18-ea-b26 (set
RUNTIME_JAVA_HOMEto it) - The lucene-core.jar will be JDK18 class files and requires JDK-18 to execute.
- Also you need to add
--add-modules jdk.incubator.foreignto the command line of your Java program/Solr server/Elasticsearch server
It would be good to get some benchmarks, especially by @rmuir or @mikemccand. Take your time and enjoy the complexity of setting this up! ;-)
My plan is the following:
- report any bugs or slowness, especially with Hotspot optimizations. The last time I talked to Maurizio, he taked about Hotspot not being able to fully optimize for-loops with long instead of int, so it may take some time until the full performance is there.
- wait until the final version of project PANAMA-foreign goes into Java's Core Library (
java.base, no module needed anymore) - add a MR-JAR for lucene-core.jar and compile the MemorySegmentIndexInput and maybe some helper classes with JDK 18/19 (hopefully?).
- Add a self-standing JDK-18 compiled module as external JAR. This can be added to classpath or moudle-path and be used by Elasticsearch or Solr. I will work on a Lucene-external project to do this.
In contrast to previous drafts, the branch was squashed into one commit. This makes review easier.
Take your time and enjoy the complexity of setting this up! ;-)
LOL! OK I will try to test this @uschindler :)
OK, thank you @uschindler and @rmuir for helping me debug the tricky setup! I ran this perf.py using luceneutil:
import sys
sys.path.insert(0, '/l/util/src/python')
import competition
if __name__ == '__main__':
sourceData = competition.sourceData()
comp = competition.Competition()
checkout = 'trunk'
checkoutNewMMap = 'trunk-new-mmap'
index = comp.newIndex(checkout, sourceData, numThreads=12, addDVFields=True, verbose=True,
grouping=False, useCMS=True,
javaCommand='/opt/jdk-18-ea-28/bin/java --add-modules jdk.incubator.foreign -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp',
analyzer = 'StandardAnalyzerNoStopWords',
facets = (('taxonomy:Date', 'Date'),
('taxonomy:Month', 'Month'),
('taxonomy:DayOfYear', 'DayOfYear'),
('taxonomy:RandomLabel.taxonomy', 'RandomLabel'),
('sortedset:Month', 'Month'),
('sortedset:DayOfYear', 'DayOfYear'),
('sortedset:RandomLabel.sortedset', 'RandomLabel')))
comp.competitor('base', checkout, index=index,
javacCommand='/opt/jdk-18-ea-28/bin/javac',
javaCommand='/opt/jdk-18-ea-28/bin/java --add-modules jdk.incubator.foreign -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp')
comp.competitor('new-mmap', checkoutNewMMap, index=index,
javacCommand='/opt/jdk-18-ea-28/bin/javac',
javaCommand='/opt/jdk-18-ea-28/bin/java --add-modules jdk.incubator.foreign -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp')
comp.benchmark('new-mmap')
I set my JAVA_HOME to JDK 17 (17.0.1+12-LTS-39) and RUNTIME_JAVA_HOME to JDK 18-ea-b28 (18-ea+28-1975). I used git commit 119c7c29ae697a52c91116f2414f973509830267 from Lucene main, and then @uschindler's branch behind this PR.
Here's the results after 20 JVM iterations:
Task QPS base StdDevQPS new-mmap StdDev Pct diff p-value
BrowseMonthSSDVFacets 8.07 (12.6%) 7.18 (13.4%) -11.0% ( -32% - 17%) 0.008
BrowseMonthTaxoFacets 4.67 (5.7%) 4.33 (2.6%) -7.2% ( -14% - 1%) 0.000
BrowseRandomLabelSSDVFacets 5.34 (6.6%) 5.08 (6.4%) -4.9% ( -16% - 8%) 0.017
IntNRQ 49.91 (7.0%) 48.07 (2.3%) -3.7% ( -12% - 6%) 0.026
PKLookup 126.62 (4.6%) 122.06 (3.4%) -3.6% ( -11% - 4%) 0.005
BrowseDayOfYearSSDVFacets 7.46 (12.8%) 7.28 (16.8%) -2.5% ( -28% - 31%) 0.598
Respell 25.49 (1.1%) 24.97 (1.2%) -2.1% ( -4% - 0%) 0.000
Fuzzy1 40.18 (1.5%) 39.52 (1.4%) -1.7% ( -4% - 1%) 0.000
Fuzzy2 31.18 (1.8%) 30.67 (1.5%) -1.6% ( -4% - 1%) 0.002
HighSloppyPhrase 19.11 (5.7%) 18.99 (5.2%) -0.6% ( -10% - 10%) 0.710
Wildcard 59.01 (6.8%) 58.89 (6.9%) -0.2% ( -13% - 14%) 0.926
LowSloppyPhrase 14.92 (3.7%) 14.92 (3.4%) 0.0% ( -6% - 7%) 0.978
MedSloppyPhrase 117.00 (3.7%) 117.28 (3.2%) 0.2% ( -6% - 7%) 0.829
MedTermDayTaxoFacets 22.39 (3.3%) 22.51 (4.2%) 0.5% ( -6% - 8%) 0.649
Prefix3 62.59 (5.3%) 62.99 (5.8%) 0.6% ( -9% - 12%) 0.713
BrowseRandomLabelTaxoFacets 3.93 (3.9%) 3.95 (6.3%) 0.7% ( -9% - 11%) 0.669
LowTerm 678.95 (3.2%) 684.44 (4.4%) 0.8% ( -6% - 8%) 0.505
OrHighMed 61.65 (2.9%) 62.22 (2.1%) 0.9% ( -3% - 6%) 0.252
AndHighHighDayTaxoFacets 5.64 (4.5%) 5.70 (4.1%) 1.0% ( -7% - 10%) 0.450
OrHighHigh 16.45 (3.1%) 16.63 (2.3%) 1.1% ( -4% - 6%) 0.220
MedPhrase 157.72 (2.1%) 159.52 (2.5%) 1.1% ( -3% - 5%) 0.117
HighPhrase 110.71 (3.9%) 112.10 (2.7%) 1.3% ( -5% - 8%) 0.237
OrHighLow 270.14 (3.2%) 274.07 (3.0%) 1.5% ( -4% - 7%) 0.135
HighTermTitleBDVSort 7.37 (3.7%) 7.49 (3.2%) 1.5% ( -5% - 8%) 0.170
AndHighHigh 44.95 (5.4%) 45.63 (4.6%) 1.5% ( -7% - 12%) 0.336
HighSpanNear 7.27 (6.4%) 7.39 (5.2%) 1.6% ( -9% - 14%) 0.390
BrowseDayOfYearTaxoFacets 4.37 (7.5%) 4.45 (9.8%) 1.8% ( -14% - 20%) 0.512
AndHighMedDayTaxoFacets 63.88 (2.6%) 65.05 (1.3%) 1.8% ( -2% - 5%) 0.005
BrowseDateTaxoFacets 4.37 (7.6%) 4.45 (10.0%) 1.8% ( -14% - 20%) 0.513
TermDTSort 379.61 (2.6%) 386.94 (2.2%) 1.9% ( -2% - 6%) 0.011
OrHighMedDayTaxoFacets 5.48 (3.4%) 5.59 (4.5%) 2.0% ( -5% - 10%) 0.113
MedSpanNear 3.79 (2.3%) 3.86 (3.7%) 2.0% ( -3% - 8%) 0.042
HighTermDayOfYearSort 1151.05 (4.4%) 1174.57 (6.2%) 2.0% ( -8% - 13%) 0.227
AndHighMed 56.38 (5.3%) 57.64 (5.9%) 2.2% ( -8% - 14%) 0.208
HighTerm 976.99 (6.7%) 1002.21 (6.8%) 2.6% ( -10% - 17%) 0.225
LowIntervalsOrdered 12.43 (4.8%) 12.77 (5.2%) 2.8% ( -6% - 13%) 0.079
LowSpanNear 9.60 (2.4%) 9.87 (1.4%) 2.8% ( 0% - 6%) 0.000
OrHighNotMed 598.12 (4.1%) 614.79 (4.2%) 2.8% ( -5% - 11%) 0.034
HighTermMonthSort 42.77 (14.2%) 44.03 (19.5%) 3.0% ( -26% - 42%) 0.584
MedIntervalsOrdered 29.73 (4.0%) 30.68 (4.5%) 3.2% ( -5% - 12%) 0.017
OrNotHighHigh 555.82 (3.9%) 573.67 (4.3%) 3.2% ( -4% - 11%) 0.013
HighIntervalsOrdered 4.36 (6.5%) 4.50 (5.9%) 3.3% ( -8% - 16%) 0.094
OrHighNotLow 699.58 (5.0%) 723.40 (5.0%) 3.4% ( -6% - 14%) 0.031
OrNotHighMed 511.29 (3.9%) 529.02 (3.6%) 3.5% ( -3% - 11%) 0.004
OrNotHighLow 419.51 (3.9%) 434.62 (2.6%) 3.6% ( -2% - 10%) 0.000
LowPhrase 241.42 (3.2%) 250.97 (2.1%) 4.0% ( -1% - 9%) 0.000
OrHighNotHigh 562.96 (3.9%) 585.87 (3.9%) 4.1% ( -3% - 12%) 0.001
AndHighLow 293.83 (5.5%) 306.09 (1.8%) 4.2% ( -2% - 12%) 0.001
MedTerm 1022.47 (6.6%) 1066.29 (4.4%) 4.3% ( -6% - 16%) 0.015
SSDV and Taxo facets maybe got a bit slower, and lots of queries got a bit faster.
This was the merged CPU profile results for this new mmap impl:
PROFILE SUMMARY from 894683 events (total: 894683)
tests.profile.mode=cpu
tests.profile.count=30
tests.profile.stacksize=1
tests.profile.linenumbers=false
PERCENT CPU SAMPLES STACK
4.27% 38211 org.apache.lucene.index.SingletonSortedNumericDocValues#nextDoc()
4.15% 37164 org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegment()
3.56% 31835 org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$20#ordValue()
2.93% 26214 org.apache.lucene.util.packed.DirectReader$DirectPackedReader20#get()
2.87% 25641 org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$4#longValue()
2.47% 22090 org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts#countAll()
2.43% 21784 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$EverythingEnum#nextPosition()
2.17% 19392 org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$3#longValue()
2.10% 18801 org.apache.lucene.search.ConjunctionDISI#doNext()
2.10% 18781 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$EverythingEnum#advance()
1.97% 17597 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$EverythingEnum#skipPositions()
1.93% 17238 jdk.internal.foreign.AbstractMemorySegmentImpl#checkBoundsSmall()
1.85% 16576 jdk.internal.misc.ScopedMemoryAccess#getByteInternal()
1.81% 16231 org.apache.lucene.queries.spans.NearSpansOrdered#stretchToOrder()
1.74% 15561 org.apache.lucene.queries.intervals.OrderedIntervalsSource$OrderedIntervalIterator#nextInterval()
1.73% 15498 org.apache.lucene.store.MemorySegmentIndexInput$SingleSegmentImpl#readByte()
1.53% 13721 jdk.internal.misc.ScopedMemoryAccess#getIntUnalignedInternal()
1.49% 13317 jdk.internal.foreign.AbstractMemorySegmentImpl#isSet()
1.38% 12362 org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#increment()
1.34% 12016 org.apache.lucene.queries.spans.TermSpans#nextStartPosition()
1.16% 10395 org.apache.lucene.search.TermScorer#score()
1.16% 10338 jdk.internal.foreign.AbstractMemorySegmentImpl#checkBounds()
1.10% 9856 org.apache.lucene.util.packed.DirectReader$DirectPackedReader4#get()
1.01% 9014 org.apache.lucene.queries.intervals.IntervalFilter#nextInterval()
0.96% 8580 jdk.internal.foreign.SharedScope#checkValidState()
0.93% 8349 org.apache.lucene.index.SingletonSortedSetDocValues#getValueCount()
0.90% 8020 org.apache.lucene.search.ScoreCachingWrappingScorer#score()
0.86% 7654 org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$DenseNumericDocValues#advance()
0.82% 7361 org.apache.lucene.queries.spans.SpanScorer#setFreqCurrentDoc()
0.82% 7328 org.apache.lucene.search.Weight$DefaultBulkScorer#scoreAll()
versus baseline CPU JFR profiler results:
PROFILE SUMMARY from 894453 events (total: 894453)
tests.profile.mode=cpu
tests.profile.count=30
tests.profile.stacksize=1
tests.profile.linenumbers=false
PERCENT CPU SAMPLES STACK
5.93% 53070 org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$20#ordValue()
4.26% 38078 org.apache.lucene.index.SingletonSortedNumericDocValues#nextDoc()
3.84% 34318 org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegment()
3.65% 32685 jdk.internal.misc.Unsafe#convEndian()
2.86% 25554 org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$4#longValue()
2.74% 24483 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$EverythingEnum#nextPosition()
2.64% 23617 org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts#countAll()
2.18% 19515 org.apache.lucene.search.ConjunctionDISI#doNext()
2.17% 19373 org.apache.lucene.util.packed.DirectReader$DirectPackedReader4#get()
2.12% 18958 org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$3#longValue()
1.93% 17298 org.apache.lucene.util.packed.DirectReader$DirectPackedReader20#get()
1.93% 17258 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$EverythingEnum#advance()
1.82% 16284 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$EverythingEnum#skipPositions()
1.75% 15647 org.apache.lucene.search.TermScorer#score()
1.71% 15292 org.apache.lucene.codecs.lucene90.ForUtil#expand8()
1.67% 14979 org.apache.lucene.queries.intervals.OrderedIntervalsSource$OrderedIntervalIterator#nextInterval()
1.65% 14744 org.apache.lucene.store.ByteBufferGuard#ensureValid()
1.57% 14061 org.apache.lucene.queries.spans.NearSpansOrdered#stretchToOrder()
1.15% 10247 org.apache.lucene.queries.spans.TermSpans#nextStartPosition()
1.14% 10222 java.util.Objects#checkIndex()
1.12% 9990 java.nio.Buffer#scope()
1.06% 9459 org.apache.lucene.store.ByteBufferGuard#getByte()
0.98% 8724 org.apache.lucene.queries.intervals.IntervalFilter#nextInterval()
0.91% 8179 org.apache.lucene.search.Weight$DefaultBulkScorer#scoreAll()
0.88% 7906 org.apache.lucene.search.ScoreCachingWrappingScorer#score()
0.88% 7867 org.apache.lucene.store.ByteBufferIndexInput#buildSlice()
0.87% 7823 org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$DenseNumericDocValues#advance()
0.87% 7789 org.apache.lucene.store.ByteBufferGuard#getInt()
0.84% 7518 org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#increment()
0.74% 6639 org.apache.lucene.codecs.lucene90.Lucene90NormsProducer$3#longValue()
It's curious how costly SingletonSortedNumericDocValues#nextDoc is. I think these facet fields are dense.
Thanks Mike, I saw similar results a month ago.
It is very important to do the following:
- don't disable tiered compilation and don't enable batch
- use large index and long lifetime of a single JVM
Reason: panama uses many modern language features like varhandles that have a lot of costly runtime library bootstraps (like dynamic bytecode). If you disable tiered it can't adapt to this. So optimization is split into runtime and Hotspot. This only works good when using tiered compilation.
Also here is the heap JFR results for base:
PROFILE SUMMARY from 2423 events (total: 94219M)
tests.profile.mode=heap
tests.profile.count=30
tests.profile.stacksize=1
tests.profile.linenumbers=false
PERCENT HEAP SAMPLES STACK
9.76% 9191M org.apache.lucene.util.FixedBitSet#<init>()
6.93% 6527M org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegment()
6.07% 5721M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame#<init>()
5.77% 5437M org.apache.lucene.facet.FacetsConfig#stringToPath()
5.20% 4896M java.util.concurrent.locks.AbstractQueuedSynchronizer#acquire()
3.65% 3436M perf.StatisticsHelper#startStatistics()
3.65% 3436M java.util.concurrent.CopyOnWriteArrayList#iterator()
3.46% 3255M java.lang.StringUTF16#compress()
3.39% 3195M java.util.ArrayList#grow()
2.97% 2800M org.apache.lucene.util.BytesRef#utf8ToString()
2.91% 2743M org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#<init>()
2.69% 2534M jdk.internal.misc.Unsafe#allocateUninitializedArray()
2.54% 2396M java.util.AbstractList#iterator()
2.33% 2199M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum#<init>()
2.20% 2070M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#getFrame()
2.01% 1889M org.apache.lucene.util.fst.ByteSequenceOutputs#read()
1.82% 1718M org.apache.lucene.queryparser.charstream.FastCharStream#refill()
1.81% 1709M org.apache.lucene.util.BytesRef#<init>()
1.48% 1392M org.apache.lucene.util.DocIdSetBuilder$Buffer#<init>()
1.29% 1219M org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame#load()
1.25% 1176M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#<init>()
1.23% 1159M org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$TermsDict#decompressBlock()
1.20% 1133M jdk.internal.foreign.MappedMemorySegmentImpl#dup()
1.14% 1073M org.apache.lucene.search.BooleanScorer#<init>()
1.03% 970M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#<init>()
1.03% 970M org.apache.lucene.search.ExactPhraseMatcher$1#getImpacts()
0.97% 910M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#getArc()
0.88% 833M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsPostingsEnum#<init>()
0.88% 833M org.apache.lucene.codecs.lucene90.ForUtil#<init>()
0.87% 816M
and for new-mmap:
PROFILE SUMMARY from 2424 events (total: 96906M)
tests.profile.mode=heap
tests.profile.count=30
tests.profile.stacksize=1
tests.profile.linenumbers=false
PERCENT HEAP SAMPLES STACK
10.17% 9854M org.apache.lucene.util.FixedBitSet#<init>()
6.96% 6743M org.apache.lucene.facet.FacetsConfig#stringToPath()
6.84% 6631M org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegment()
5.16% 4999M java.util.concurrent.locks.AbstractQueuedSynchronizer#acquire()
4.18% 4054M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame#<init>()
3.58% 3470M org.apache.lucene.util.BytesRef#utf8ToString()
3.55% 3436M perf.StatisticsHelper#startStatistics()
3.55% 3436M java.util.concurrent.CopyOnWriteArrayList#iterator()
3.49% 3379M org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#<init>()
2.77% 2688M java.util.ArrayList#grow()
2.71% 2628M java.lang.StringUTF16#compress()
2.54% 2465M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#getFrame()
2.15% 2087M java.util.AbstractList#iterator()
2.05% 1984M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum#<init>()
1.94% 1881M org.apache.lucene.util.fst.ByteSequenceOutputs#read()
1.84% 1786M jdk.internal.misc.Unsafe#allocateUninitializedArray()
1.74% 1683M java.nio.DirectByteBufferR#duplicate()
1.65% 1598M org.apache.lucene.util.DocIdSetBuilder$Buffer#<init>()
1.47% 1426M org.apache.lucene.util.BytesRef#<init>()
1.40% 1357M java.nio.DirectByteBufferR#asLongBuffer()
1.27% 1228M org.apache.lucene.search.ExactPhraseMatcher$1#getImpacts()
1.24% 1202M org.apache.lucene.queryparser.charstream.FastCharStream#refill()
1.24% 1202M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#newTermState()
1.15% 1116M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#<init>()
1.13% 1090M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#<init>()
1.10% 1065M org.apache.lucene.store.ByteBufferIndexInput#newCloneInstance()
1.05% 1013M java.nio.DirectByteBufferR#slice()
0.91% 884M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsPostingsEnum#<init>()
0.88% 850M org.apache.lucene.search.BooleanScorer#<init>()
0.86% 833M
It is very important to do the following:
- don't disable tiered compilation and don't enable batch
- use large index and long lifetime of a single JVM
Thanks @uschindler -- I think I am not disabling tiered compilation.
I run 20 iterations of each task, but it is relatively quick (~13-14 seconds per JVM instance). I can retry with more per-task iterations to see if the tiered compilation can improve things later on.
Here's the full JVM command of base and competitor:
iter 19
new-mmap:
log: /l/logs.nightly/prefix.new-mmap.19 + stdout
run: /opt/jdk-18-ea-28/bin/java --add-modules jdk.incubator.foreign -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp -XX:StartFlightRecording=dumponexit=true,maxsize=250M,settings=/l/util/src/\
python/profiling.jfc,filename=/l/util/bench-search-prefix-new-mmap-19.jfr -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -classpath /l/trunk-new-mmap/lucene/core/build/libs/lucene-core-10.0.0-SNAPSHOT.jar:/l/\
trunk-new-mmap/lucene/core/build/classes/java/test:/l/trunk-new-mmap/lucene/sandbox/build/classes/java/main:/l/trunk-new-mmap/lucene/misc/build/classes/java/main:/l/trunk-new-mmap/lucene/facet/build/classes/java/main\
:/l/trunk-new-mmap/lucene/analysis/common/build/classes/java/main:/l/trunk-new-mmap/lucene/analysis/icu/build/classes/java/main:/l/trunk-new-mmap/lucene/queryparser/build/classes/java/main:/l/trunk-new-mmap/lucene/gr\
ouping/build/classes/java/main:/l/trunk-new-mmap/lucene/suggest/build/classes/java/main:/l/trunk-new-mmap/lucene/highlighter/build/classes/java/main:/l/trunk-new-mmap/lucene/codecs/build/classes/java/main:/l/trunk-ne\
w-mmap/lucene/queries/build/classes/java/main:/home/mike/.gradle/caches/modules-2/files-2.1/com.carrotsearch/hppc/0.9.0/fcc952fb6d378266b943bef9f15e67a4d45cfa88/hppc-0.9.0.jar:/l/util/lib/HdrHistogram.jar:/l/util/bui\
ld perf.SearchPerfTest -dirImpl MMapDirectory -indexPath /l/indices/wikimediumall.trunk.facets.taxonomy:Date.taxonomy:Month.taxonomy:DayOfYear.taxonomy:RandomLabel.taxonomy.sortedset:Month.sortedset:DayOfYear.sorteds\
et:RandomLabel.sortedset.Lucene90.Lucene90.dvfields.nd27.625M -facets taxonomy:Date;Date -facets taxonomy:Month;Month -facets taxonomy:DayOfYear;DayOfYear -facets taxonomy:RandomLabel.taxonomy;RandomLabel -facets sor\
tedset:Month;Month -facets sortedset:DayOfYear;DayOfYear -facets sortedset:RandomLabel.sortedset;RandomLabel -analyzer StandardAnalyzerNoStopWords -taskSource /l/util/tasks/wikimedium.10M.nostopwords.tasks -searchThr\
eadCount 6 -taskRepeatCount 20 -field body -tasksPerCat 1 -staticSeed -291966 -seed -5054409 -similarity BM25Similarity -commit multi -hiliteImpl FastVectorHighlighter -log /l/logs.nightly/prefix.new-mmap.19 -topN 10\
-pk
13.1 s
base:
log: /l/logs.nightly/prefix.base.19 + stdout
run: /opt/jdk-18-ea-28/bin/java --add-modules jdk.incubator.foreign -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp -XX:StartFlightRecording=dumponexit=true,maxsize=250M,settings=/l/util/src/\
python/profiling.jfc,filename=/l/util/bench-search-prefix-base-19.jfr -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -classpath /l/trunk/lucene/core/build/libs/lucene-core-10.0.0-SNAPSHOT.jar:/l/trunk/lucene/\
core/build/classes/java/test:/l/trunk/lucene/sandbox/build/classes/java/main:/l/trunk/lucene/misc/build/classes/java/main:/l/trunk/lucene/facet/build/classes/java/main:/l/trunk/lucene/analysis/common/build/classes/ja\
va/main:/l/trunk/lucene/analysis/icu/build/classes/java/main:/l/trunk/lucene/queryparser/build/classes/java/main:/l/trunk/lucene/grouping/build/classes/java/main:/l/trunk/lucene/suggest/build/classes/java/main:/l/tru\
nk/lucene/highlighter/build/classes/java/main:/l/trunk/lucene/codecs/build/classes/java/main:/l/trunk/lucene/queries/build/classes/java/main:/home/mike/.gradle/caches/modules-2/files-2.1/com.carrotsearch/hppc/0.9.0/f\
cc952fb6d378266b943bef9f15e67a4d45cfa88/hppc-0.9.0.jar:/l/util/lib/HdrHistogram.jar:/l/util/build perf.SearchPerfTest -dirImpl MMapDirectory -indexPath /l/indices/wikimediumall.trunk.facets.taxonomy:Date.taxonomy:Mon\
th.taxonomy:DayOfYear.taxonomy:RandomLabel.taxonomy.sortedset:Month.sortedset:DayOfYear.sortedset:RandomLabel.sortedset.Lucene90.Lucene90.dvfields.nd27.625M -facets taxonomy:Date;Date -facets taxonomy:Month;Month -fa\
cets taxonomy:DayOfYear;DayOfYear -facets taxonomy:RandomLabel.taxonomy;RandomLabel -facets sortedset:Month;Month -facets sortedset:DayOfYear;DayOfYear -facets sortedset:RandomLabel.sortedset;RandomLabel -analyzer St\
andardAnalyzerNoStopWords -taskSource /l/util/tasks/wikimedium.10M.nostopwords.tasks -searchThreadCount 6 -taskRepeatCount 20 -field body -tasksPerCat 1 -staticSeed -291966 -seed -5054409 -similarity BM25Similarity -\
commit multi -hiliteImpl FastVectorHighlighter -log /l/logs.nightly/prefix.base.19 -topN 10 -pk
13.7 s
Hi @dweiss, I'm so sorry. I just merged yesterday but didn't run tests. The test is obsolete on branch.
No worries - just tweak whatever you want. I just didn't want jenkins to complain over and over.
I corrected the test to assert the jdk.unsupported.foreign module. But as this won't compile without it, it is in fact obsolete. But for easier merging in future I left it. I also modified the previous JDK 17 and JDK 16 pull requests.
Next step is to merge up the changes from yesterday.
I re-ran benchmarks with more iterations per task (200 vs 20 before) to let hotspot have more time to optimize in each of the 20 JVMs:
Task QPS base StdDevQPS new-mmap StdDev Pct diff p-value
BrowseMonthTaxoFacets 4.71 (6.2%) 4.39 (3.3%) -6.8% ( -15% - 2%) 0.000
BrowseMonthSSDVFacets 7.86 (14.8%) 7.34 (10.9%) -6.7% ( -28% - 22%) 0.104
HighTermMonthSort 59.25 (12.5%) 57.42 (11.8%) -3.1% ( -24% - 24%) 0.420
PKLookup 138.18 (1.1%) 135.03 (1.6%) -2.3% ( -4% - 0%) 0.000
Fuzzy1 45.86 (1.3%) 45.15 (1.5%) -1.6% ( -4% - 1%) 0.000
HighSloppyPhrase 7.90 (5.1%) 7.78 (5.1%) -1.6% ( -11% - 9%) 0.332
Fuzzy2 38.89 (1.1%) 38.30 (1.3%) -1.5% ( -3% - 0%) 0.000
MedPhrase 14.90 (2.7%) 14.72 (3.6%) -1.2% ( -7% - 5%) 0.230
Respell 36.40 (1.2%) 36.05 (1.5%) -0.9% ( -3% - 1%) 0.029
LowPhrase 21.59 (2.1%) 21.43 (2.5%) -0.7% ( -5% - 3%) 0.304
BrowseDayOfYearTaxoFacets 4.58 (12.3%) 4.56 (10.4%) -0.6% ( -20% - 25%) 0.861
BrowseRandomLabelTaxoFacets 4.06 (8.1%) 4.04 (7.5%) -0.6% ( -14% - 16%) 0.817
BrowseDateTaxoFacets 4.57 (12.2%) 4.55 (10.5%) -0.4% ( -20% - 25%) 0.904
OrHighHigh 18.30 (3.4%) 18.22 (2.5%) -0.4% ( -6% - 5%) 0.663
Wildcard 42.52 (13.2%) 42.55 (12.7%) 0.1% ( -22% - 29%) 0.989
AndHighHighDayTaxoFacets 21.35 (2.1%) 21.36 (1.8%) 0.1% ( -3% - 4%) 0.926
OrHighMed 91.02 (3.1%) 91.16 (2.4%) 0.2% ( -5% - 5%) 0.860
MedTermDayTaxoFacets 22.74 (5.0%) 22.78 (3.0%) 0.2% ( -7% - 8%) 0.902
HighIntervalsOrdered 3.79 (2.9%) 3.80 (3.3%) 0.2% ( -5% - 6%) 0.814
IntNRQ 93.38 (8.2%) 93.60 (1.0%) 0.2% ( -8% - 10%) 0.897
HighTermTitleBDVSort 4.38 (5.9%) 4.40 (4.6%) 0.3% ( -9% - 11%) 0.841
OrHighMedDayTaxoFacets 8.05 (3.2%) 8.08 (4.2%) 0.3% ( -6% - 8%) 0.770
AndHighHigh 23.23 (2.8%) 23.31 (3.5%) 0.4% ( -5% - 6%) 0.705
AndHighMedDayTaxoFacets 42.68 (2.2%) 42.87 (1.6%) 0.5% ( -3% - 4%) 0.455
TermDTSort 667.66 (1.7%) 670.71 (1.3%) 0.5% ( -2% - 3%) 0.342
AndHighMed 98.30 (3.1%) 98.84 (3.3%) 0.5% ( -5% - 7%) 0.589
Prefix3 33.28 (3.4%) 33.47 (4.3%) 0.6% ( -6% - 8%) 0.645
OrHighLow 404.97 (3.6%) 407.82 (2.1%) 0.7% ( -4% - 6%) 0.446
LowSloppyPhrase 62.46 (2.6%) 62.91 (2.1%) 0.7% ( -3% - 5%) 0.334
OrNotHighHigh 713.35 (3.2%) 719.56 (2.5%) 0.9% ( -4% - 6%) 0.332
LowTerm 951.78 (1.9%) 960.95 (1.7%) 1.0% ( -2% - 4%) 0.093
MedSloppyPhrase 28.24 (2.4%) 28.57 (2.3%) 1.2% ( -3% - 5%) 0.115
MedTerm 1173.70 (2.1%) 1187.71 (2.7%) 1.2% ( -3% - 6%) 0.121
OrNotHighMed 576.27 (2.1%) 583.63 (1.7%) 1.3% ( -2% - 5%) 0.035
HighPhrase 201.44 (3.4%) 204.22 (1.7%) 1.4% ( -3% - 6%) 0.101
HighTermDayOfYearSort 1822.34 (2.1%) 1847.93 (1.5%) 1.4% ( -2% - 5%) 0.015
BrowseRandomLabelSSDVFacets 5.22 (7.4%) 5.30 (7.2%) 1.5% ( -12% - 17%) 0.515
OrHighNotLow 811.72 (3.1%) 824.52 (2.7%) 1.6% ( -4% - 7%) 0.086
HighTerm 963.95 (3.3%) 979.31 (3.6%) 1.6% ( -5% - 8%) 0.142
LowIntervalsOrdered 55.56 (3.7%) 56.49 (3.3%) 1.7% ( -5% - 8%) 0.133
MedIntervalsOrdered 12.72 (3.9%) 12.95 (3.7%) 1.8% ( -5% - 9%) 0.141
MedSpanNear 7.21 (3.7%) 7.37 (3.2%) 2.1% ( -4% - 9%) 0.049
OrHighNotHigh 781.73 (2.7%) 799.12 (2.7%) 2.2% ( -3% - 7%) 0.009
AndHighLow 514.79 (4.0%) 526.55 (1.3%) 2.3% ( -2% - 7%) 0.015
LowSpanNear 22.27 (2.6%) 22.80 (2.2%) 2.4% ( -2% - 7%) 0.002
OrNotHighLow 477.71 (5.1%) 489.15 (1.3%) 2.4% ( -3% - 9%) 0.042
OrHighNotMed 913.86 (2.3%) 935.89 (2.3%) 2.4% ( -2% - 7%) 0.001
HighSpanNear 14.42 (4.9%) 14.83 (3.9%) 2.8% ( -5% - 12%) 0.044
BrowseDayOfYearSSDVFacets 7.33 (21.1%) 7.79 (17.2%) 6.3% ( -26% - 56%) 0.301
Thanks Mike. It is mich more stable now (std dev) and on average 0%. We should figure out why it gets faster on some parts while slower on other parts.
What is different:
- some parts use direct access off-heap
- some parts mainly copy byte arrays between mmap and heap and do the work on-heap
From what I have learned, copy operations have high overhead because:
- they are not hot, so aren't optimized so fast
- when not optimized, the setup cost is high (lots of class checks to get array type, decision for swapping bytes). This is especially heavy for small arrays.
When discussing with Robert it looks like it might be better to just have a simple copy-loop. This affects long[] arrays, as those are <64 entries. We can test this easily by commenting out the copy method for floats and longs, so it falls back to default impl in IndexInput.
I just had no time to test this.
But on long term we should do everything off-heap. Especially the vector stuff. But for that we need to change IndexInput to allow to return Float vector or LongVector instances backed off-heap. The default impl would just copy as before and return a view.
But that needs to wait until vector API goes out of incubator and preview phases.
From what I have learned, copy operations have high overhead because:
* they are not hot, so aren't optimized so fast * when not optimized, the setup cost is high (lots of class checks to get array type, decision for swapping bytes). This is especially heavy for small arrays.
Hi, I'm not sure as to why copy operations should be slower in the memory access API than with the ByteBuffer API. I would expect most of the checks to be similar (except for the liveness tests of the segment involved). I do recall that the ByteBuffer API does optimize bulk copy for very small buffers (I don't recall what the limit is, but it was very very low, like 4 elements or something).
In principle, this JVM fix (as per 18) should help too: https://bugs.openjdk.java.net/browse/JDK-8269119
I'm working on a similar approach for my data store, but I'm currently not sure if it's a good idea for multiple readers plus a single reader/writer to map a segment for each reader. I guess the OS will then share the mapped regions/the pages between the mapped memory segments? Not sure if it's the same approach in Lucene, so that you'd create multiple IndexInputs for multiple index readers, because you also seem to a have a clone method (but it will fail once the segments are closed from one reader).
On another note, what's your take on this (Andy and Victor are real genioses regarding database systems)? http://cidrdb.org/cidr2022/papers/p13-crotty.pdf
I'm working on a similar approach for my data store, but I'm currently not sure if it's a good idea for multiple readers plus a single reader/writer to map a segment for each reader. I guess the OS will then share the mapped regions/the pages between the mapped memory segments? Not sure if it's the same approach in Lucene, so that you'd create multiple
IndexInputs for multiple index readers, because you also seem to a have a clone method (but it will fail once the segments are closed from one reader).
This PR does not change anything in Lucene's current behaviour. The code using MappedByteBuffer behaves the same way. There are also no multiple mappings. If a user may open several IndexReaders on the same index that's not our fault. Well behaving code of Lucene only opens a single IndexReader.
The clone() method is used for several threads. There is no remapping, we only refcount the ResourceContext with Panama. If you close the main index, the clones used by different threads should really fail then - thats the improvement here.
On another note, what's your take on this (Andy and Victor are real genioses regarding database systems)? http://cidrdb.org/cidr2022/papers/p13-crotty.pdf
We don't agree with that for Lucene:
- the model behind Lucene is different: All files are write-once so there are no updates to files which were written before. MMAP only works on files that are never changed anymore. The paging works very well with those.
- we do not write with mmap, lucene index files are written with standard output streams
So, if the writer adds something to the lucene index (not via mmap) new index readers will create a new IndexInput with new mapped memory segments plus clones with the same segments for different threads, right? Isn't it a valid use case here to have multiple index readers or is it supposed to close the index reader and the clones first?
In my case (SirixDB) you're supposed to create multiple read-only transactions bound to a specific revision if you like to use different threads for instance (currently I've shared a memory segment). However the writer appends to the data file and a new read-only trx thus has to get a new map of the file (to read the most recently committed revision). Thus, either a new memory segment must be set and even set for all readers guarded with a Semaphore for instance or I'll have to use multiple memory segments, but I guess that's a bad approach.
Either way it would be similar (multiple readers and a single writer which only ever appends data). In your case you're cloning the IndexInput and append to the data file(s) without a memory mapping. However, it's not clear to me if it's a valid use case for lucene to have multiple index readers, but you suggest that it's not the approach to use. But how do you make sure that new index readers (cloned or not) will see all the changes?
quoting from Uwe:
All files are write-once so there are no updates to files which were written before.
This is the key piece that I think you are missing. We write files once, that's it. No appending to them after the fact or anything like that. Any new changes will be written to new, different files.
Oh right, thanks. That's the big difference.
Thanks @rmuir for the clirifaction.
To add, because also @mcimadamore asked: We use shared segments, because we only allocate and map the segment once. It is then used by multiple threads. The IndexReader only opens every new file only once and then mmap the files. Several search threads may access the mapped files concurrently (very small files are kept on heap until the next commit, this is what NRT caching directory does). Every search thread may use a clone of the IndexInput because the IndexInput has per-thread state information (like a read position), but the underlying memory segments are reused and the Indexreader only closes the "main" IndexInput. Possible clones will get invalid.
On changes of index and after final commit, new files are writen to disk and fsynced (including the directory metadata). IndexWriter gets reopened and mmaps new files seen and releases old and no longer used ones by a close.
Any thread that still accesses already closed files will get AlreadyClosedException (previously those may have sigsegfaulted due to forceful unmapping of MappedByteBuffer). With MMapDirectory using MemorySegments this is detected by IllegalStateException and transformed to AlreadyClosed and seen by search threads. So all is sane.
Thanks for your great explanation. Makes a lot of sense.
Do you know if the MAP_SHARED flag is set for mapped memory segments? I guess this means that even if I'm opening a few mapped memory segments on the same file (even if they might only have overlapping regions and the segments will grow in size as the file grows) the virtual address space and the loaded pages will be shared when opened in the same FileChannel.MapMode (READ_ONLY) mode for instance. I also think setting madivse for random index accesses in my case would be of advantage, as the main trie index access pattern might be random.
MAP_SHARED
Share this mapping. Updates to the mapping are visible to other
processes mapping the same region, and (in the case of file-
backed mappings) are carried through to the underlying file.
(To precisely control when updates are carried through to the
underlying file requires the use of msync(2).)
Thanks for your great explanation. Makes a lot of sense.
Do you know if the
MAP_SHAREDflag is set for mapped memory segments? I guess this means that even if I'm opening a few mapped memory segments on the same file (even if they might only have overlapping regions and the segments will grow in size as the file grows) the virtual address space and the loaded pages will be shared when opened in the sameFileChannel.MapMode(READ_ONLY) mode for instance. I also think settingmadivsefor random index accesses in my case would be of advantage, as the main trie index access pattern might be random.MAP_SHARED Share this mapping. Updates to the mapping are visible to other processes mapping the same region, and (in the case of file- backed mappings) are carried through to the underlying file. (To precisely control when updates are carried through to the underlying file requires the use of msync(2).)
Hi, IIRC, the SHARED flag should be set - that said, with the foreign API it is also possible to define custom memory mapped segments, if some of the defaults picked by the JDK is not suitable. Few months ago I put together a Gist [1] to illustrate that. I have updated that to reflect the new API changes. Perhaps Lucene might, one day, take advantage of this.
[1] - https://gist.github.com/mcimadamore/128ee904157bb6c729a10596e69edffd
Closing this as the JDK 19 impl was merged (#912).