lucene Initial rewrite of MMapDirectory for JDK-18 preview (incubating) Panama APIs (>= JDK-18-ea-b26)

INFO: This is a followup of #177: It's the same code base, but with API changes from JDK 18 applied

This is just a draft PR for a first insight on memory mapping improvements in JDK 18+.

Some background information: Starting with JDK-14, there is a new incubating module "jdk.incubator.foreign" that has a new, not yet stable API for accessing off-heap memory (and later it will also support calling functions using classical MethodHandles that are located in libraries like .so or .dll files). This incubator module has several versions:

first version: https://openjdk.java.net/jeps/370 (slow, very buggy and thread confinement, so making it unuseable with Lucene)
second version: https://openjdk.java.net/jeps/383 (still thread confinement, but now allows transfer of "ownership" to other threads; this is still impossible to use with Lucene.
third version in JDK 16: https://openjdk.java.net/jeps/393 (this version has included "Support for shared segments"). This now allows us to safely use the same external mmaped memory from different threads and also unmap it! This was implemented in the previous pull request #173
fourth version in JDK 17: https://openjdk.java.net/jeps/412 . This mainly changes the API around the scopes. Instead of having segments explicitely made "shared", we can assign them to some resource scope which control their behaviour. The resourceScope is produced one time for each IndexInput instance (not clones) and owns all segments. When the resourceScope is closed, all segments get invalid - and we throw AlreadyClosedException. The big problem is slowness due to heavy use of new instances just to copy memory between segments and java heap. This drives garbage collector crazy. This was implemented in previous PR #177
fifth version in JDK 18, included in build 26: https://openjdk.java.net/jeps/419 (actual version). This mainly cleans up the API. From Lucene's persepctive the MemorySegment API now has System.arraycopy()-like APIs to copy memory between heap and memory segments. This improves speed. It also handles byte-swapping automatically. This version of the PR also uses ValueLayout instead of varhandles, as it makes code more readable and type-safe.

This module more or less overcomes several problems:

ByteBuffer API is limited to 32bit (in fact MMapDirectory has to chunk in 1 GiB portions)
There is no official way to unmap ByteBuffers when the file is no longer used. There is a way to use sun.misc.Unsafe and forcefully unmap segments, but any IndexInput accessing the file from another thread will crush the JVM with SIGSEGV or SIGBUS. We learned to live with that and we happily apply the unsafe unmapping, but that's the main issue.

@uschindler had many discussions with the team at OpenJDK and finally with the third incubator, we have an API that works with Lucene. It was very fruitful discussions (thanks to @mcimadamore !)

With the third incubator we are now finally able to do some tests (especially performance). As this is an incubating module, this PR first changes a bit the build system:

disable -Werror for :lucene:core
add the incubating module to compiler of :lucene:core and enable it for all test builds. This is important, as you have to pass --add-modules jdk.incubator.foreign also at runtime!

The code basically just modifies MMapDirectory to use LONG instead of INT for the chunk size parameter. In addition it adds MemorySegmentIndexInput that is a copy of our ByteBufferIndexInput (still there, but unused), but using MemorySegment instead of ByteBuffer behind the scenes. It works in exactly the same way, just the try/catch blocks for supporting EOFException or moving to another segment were rewritten.

It passes all tests and it looks like you can use it to read indexes. The default chunk size is now 16 GiB (but you can raise or lower it as you like; tests are doing this). Of course you can set it to Long.MAX_VALUE, in that case every index file is always mapped to one big memory mapping. My testing with Windows 10 have shown, that this is not a good idea!!!. Huge mappings fragment address space over time and as we can only use like 43 or 46 bits (depending on OS), the fragmentation will at some point kill you. So 16 GiB looks like a good compromise: Most files will be smaller than 6 GiB anyways (unless you optimize your index to one huge segment). So for most Lucene installations, the number of segments will equal the number of open files, so Elasticsearch huge user consumers will be very happy. The sysctl max_map_count may not need to be touched anymore.

In addition, this implements readLongs in a better way than @jpountz did (no caching or arbitrary objects). The new foreign-vector APIs will in future also be written with MemorySegment in its focus. So you can allocate a vector view on a MemorySegment and let the vectorizer fully work outside java heap inside our mmapped files! :-)_

It would be good if you could checkout this branch and try it in production.

According to speed tests it should be as fast as MMAPDirectory, partially also faster because less switching between byte-buffers is needed. With recent optimizations also long-based absolute access in loops should be faster.

But be aware:

You need JDK 11 or JDK 17 to run Gradle (set JAVA_HOME to it)
You need JDK 18-ea-b26 (set RUNTIME_JAVA_HOME to it)
The lucene-core.jar will be JDK18 class files and requires JDK-18 to execute.
Also you need to add --add-modules jdk.incubator.foreign to the command line of your Java program/Solr server/Elasticsearch server

It would be good to get some benchmarks, especially by @rmuir or @mikemccand. Take your time and enjoy the complexity of setting this up! ;-)

My plan is the following:

report any bugs or slowness, especially with Hotspot optimizations. The last time I talked to Maurizio, he taked about Hotspot not being able to fully optimize for-loops with long instead of int, so it may take some time until the full performance is there.
wait until the final version of project PANAMA-foreign goes into Java's Core Library (java.base, no module needed anymore)
add a MR-JAR for lucene-core.jar and compile the MemorySegmentIndexInput and maybe some helper classes with JDK 18/19 (hopefully?).
Add a self-standing JDK-18 compiled module as external JAR. This can be added to classpath or moudle-path and be used by Elasticsearch or Solr. I will work on a Lucene-external project to do this.

Dec 05 '21 19:12 uschindler

In contrast to previous drafts, the branch was squashed into one commit. This makes review easier.

Dec 05 '21 19:12 uschindler

Take your time and enjoy the complexity of setting this up! ;-)

LOL! OK I will try to test this @uschindler :)

Dec 19 '21 23:12 mikemccand

OK, thank you @uschindler and @rmuir for helping me debug the tricky setup! I ran this perf.py using luceneutil:

import sys
sys.path.insert(0, '/l/util/src/python')

import competition

if __name__ == '__main__':
  sourceData = competition.sourceData()
  comp = competition.Competition()

  checkout = 'trunk'
  checkoutNewMMap = 'trunk-new-mmap'

  index = comp.newIndex(checkout, sourceData, numThreads=12, addDVFields=True, verbose=True,
                        grouping=False, useCMS=True,
                        javaCommand='/opt/jdk-18-ea-28/bin/java --add-modules jdk.incubator.foreign -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp',
                        analyzer = 'StandardAnalyzerNoStopWords',
                        facets = (('taxonomy:Date', 'Date'),
                                  ('taxonomy:Month', 'Month'),
                                  ('taxonomy:DayOfYear', 'DayOfYear'),
                                  ('taxonomy:RandomLabel.taxonomy', 'RandomLabel'),
                                  ('sortedset:Month', 'Month'),
                                  ('sortedset:DayOfYear', 'DayOfYear'),
                                  ('sortedset:RandomLabel.sortedset', 'RandomLabel')))
  comp.competitor('base', checkout, index=index,
                  javacCommand='/opt/jdk-18-ea-28/bin/javac',
                  javaCommand='/opt/jdk-18-ea-28/bin/java --add-modules jdk.incubator.foreign -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp')
  comp.competitor('new-mmap', checkoutNewMMap, index=index,
                  javacCommand='/opt/jdk-18-ea-28/bin/javac',
                  javaCommand='/opt/jdk-18-ea-28/bin/java --add-modules jdk.incubator.foreign -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp')
  comp.benchmark('new-mmap')

I set my JAVA_HOME to JDK 17 (17.0.1+12-LTS-39) and RUNTIME_JAVA_HOME to JDK 18-ea-b28 (18-ea+28-1975). I used git commit 119c7c29ae697a52c91116f2414f973509830267 from Lucene main, and then @uschindler's branch behind this PR.

Here's the results after 20 JVM iterations:

                            Task    QPS base      StdDevQPS new-mmap      StdDev                Pct diff p-value
           BrowseMonthSSDVFacets        8.07     (12.6%)        7.18     (13.4%)  -11.0% ( -32% -   17%) 0.008
           BrowseMonthTaxoFacets        4.67      (5.7%)        4.33      (2.6%)   -7.2% ( -14% -    1%) 0.000
     BrowseRandomLabelSSDVFacets        5.34      (6.6%)        5.08      (6.4%)   -4.9% ( -16% -    8%) 0.017
                          IntNRQ       49.91      (7.0%)       48.07      (2.3%)   -3.7% ( -12% -    6%) 0.026
                        PKLookup      126.62      (4.6%)      122.06      (3.4%)   -3.6% ( -11% -    4%) 0.005
       BrowseDayOfYearSSDVFacets        7.46     (12.8%)        7.28     (16.8%)   -2.5% ( -28% -   31%) 0.598
                         Respell       25.49      (1.1%)       24.97      (1.2%)   -2.1% (  -4% -    0%) 0.000
                          Fuzzy1       40.18      (1.5%)       39.52      (1.4%)   -1.7% (  -4% -    1%) 0.000
                          Fuzzy2       31.18      (1.8%)       30.67      (1.5%)   -1.6% (  -4% -    1%) 0.002
                HighSloppyPhrase       19.11      (5.7%)       18.99      (5.2%)   -0.6% ( -10% -   10%) 0.710
                        Wildcard       59.01      (6.8%)       58.89      (6.9%)   -0.2% ( -13% -   14%) 0.926
                 LowSloppyPhrase       14.92      (3.7%)       14.92      (3.4%)    0.0% (  -6% -    7%) 0.978
                 MedSloppyPhrase      117.00      (3.7%)      117.28      (3.2%)    0.2% (  -6% -    7%) 0.829
            MedTermDayTaxoFacets       22.39      (3.3%)       22.51      (4.2%)    0.5% (  -6% -    8%) 0.649
                         Prefix3       62.59      (5.3%)       62.99      (5.8%)    0.6% (  -9% -   12%) 0.713
     BrowseRandomLabelTaxoFacets        3.93      (3.9%)        3.95      (6.3%)    0.7% (  -9% -   11%) 0.669
                         LowTerm      678.95      (3.2%)      684.44      (4.4%)    0.8% (  -6% -    8%) 0.505
                       OrHighMed       61.65      (2.9%)       62.22      (2.1%)    0.9% (  -3% -    6%) 0.252
        AndHighHighDayTaxoFacets        5.64      (4.5%)        5.70      (4.1%)    1.0% (  -7% -   10%) 0.450
                      OrHighHigh       16.45      (3.1%)       16.63      (2.3%)    1.1% (  -4% -    6%) 0.220
                       MedPhrase      157.72      (2.1%)      159.52      (2.5%)    1.1% (  -3% -    5%) 0.117
                      HighPhrase      110.71      (3.9%)      112.10      (2.7%)    1.3% (  -5% -    8%) 0.237
                       OrHighLow      270.14      (3.2%)      274.07      (3.0%)    1.5% (  -4% -    7%) 0.135
            HighTermTitleBDVSort        7.37      (3.7%)        7.49      (3.2%)    1.5% (  -5% -    8%) 0.170
                     AndHighHigh       44.95      (5.4%)       45.63      (4.6%)    1.5% (  -7% -   12%) 0.336
                    HighSpanNear        7.27      (6.4%)        7.39      (5.2%)    1.6% (  -9% -   14%) 0.390
       BrowseDayOfYearTaxoFacets        4.37      (7.5%)        4.45      (9.8%)    1.8% ( -14% -   20%) 0.512
         AndHighMedDayTaxoFacets       63.88      (2.6%)       65.05      (1.3%)    1.8% (  -2% -    5%) 0.005
            BrowseDateTaxoFacets        4.37      (7.6%)        4.45     (10.0%)    1.8% ( -14% -   20%) 0.513
                      TermDTSort      379.61      (2.6%)      386.94      (2.2%)    1.9% (  -2% -    6%) 0.011
          OrHighMedDayTaxoFacets        5.48      (3.4%)        5.59      (4.5%)    2.0% (  -5% -   10%) 0.113
                     MedSpanNear        3.79      (2.3%)        3.86      (3.7%)    2.0% (  -3% -    8%) 0.042
           HighTermDayOfYearSort     1151.05      (4.4%)     1174.57      (6.2%)    2.0% (  -8% -   13%) 0.227
                      AndHighMed       56.38      (5.3%)       57.64      (5.9%)    2.2% (  -8% -   14%) 0.208
                        HighTerm      976.99      (6.7%)     1002.21      (6.8%)    2.6% ( -10% -   17%) 0.225
             LowIntervalsOrdered       12.43      (4.8%)       12.77      (5.2%)    2.8% (  -6% -   13%) 0.079
                     LowSpanNear        9.60      (2.4%)        9.87      (1.4%)    2.8% (   0% -    6%) 0.000
                    OrHighNotMed      598.12      (4.1%)      614.79      (4.2%)    2.8% (  -5% -   11%) 0.034
               HighTermMonthSort       42.77     (14.2%)       44.03     (19.5%)    3.0% ( -26% -   42%) 0.584
             MedIntervalsOrdered       29.73      (4.0%)       30.68      (4.5%)    3.2% (  -5% -   12%) 0.017
                   OrNotHighHigh      555.82      (3.9%)      573.67      (4.3%)    3.2% (  -4% -   11%) 0.013
            HighIntervalsOrdered        4.36      (6.5%)        4.50      (5.9%)    3.3% (  -8% -   16%) 0.094
                    OrHighNotLow      699.58      (5.0%)      723.40      (5.0%)    3.4% (  -6% -   14%) 0.031
                    OrNotHighMed      511.29      (3.9%)      529.02      (3.6%)    3.5% (  -3% -   11%) 0.004
                    OrNotHighLow      419.51      (3.9%)      434.62      (2.6%)    3.6% (  -2% -   10%) 0.000
                       LowPhrase      241.42      (3.2%)      250.97      (2.1%)    4.0% (  -1% -    9%) 0.000
                   OrHighNotHigh      562.96      (3.9%)      585.87      (3.9%)    4.1% (  -3% -   12%) 0.001
                      AndHighLow      293.83      (5.5%)      306.09      (1.8%)    4.2% (  -2% -   12%) 0.001
                         MedTerm     1022.47      (6.6%)     1066.29      (4.4%)    4.3% (  -6% -   16%) 0.015

SSDV and Taxo facets maybe got a bit slower, and lots of queries got a bit faster.

This was the merged CPU profile results for this new mmap impl:

PROFILE SUMMARY from 894683 events (total: 894683)
  tests.profile.mode=cpu
  tests.profile.count=30
  tests.profile.stacksize=1
  tests.profile.linenumbers=false
PERCENT       CPU SAMPLES   STACK
4.27%         38211         org.apache.lucene.index.SingletonSortedNumericDocValues#nextDoc()
4.15%         37164         org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegment()
3.56%         31835         org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$20#ordValue()
2.93%         26214         org.apache.lucene.util.packed.DirectReader$DirectPackedReader20#get()
2.87%         25641         org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$4#longValue()
2.47%         22090         org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts#countAll()
2.43%         21784         org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$EverythingEnum#nextPosition()
2.17%         19392         org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$3#longValue()
2.10%         18801         org.apache.lucene.search.ConjunctionDISI#doNext()
2.10%         18781         org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$EverythingEnum#advance()
1.97%         17597         org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$EverythingEnum#skipPositions()
1.93%         17238         jdk.internal.foreign.AbstractMemorySegmentImpl#checkBoundsSmall()
1.85%         16576         jdk.internal.misc.ScopedMemoryAccess#getByteInternal()
1.81%         16231         org.apache.lucene.queries.spans.NearSpansOrdered#stretchToOrder()
1.74%         15561         org.apache.lucene.queries.intervals.OrderedIntervalsSource$OrderedIntervalIterator#nextInterval()
1.73%         15498         org.apache.lucene.store.MemorySegmentIndexInput$SingleSegmentImpl#readByte()
1.53%         13721         jdk.internal.misc.ScopedMemoryAccess#getIntUnalignedInternal()
1.49%         13317         jdk.internal.foreign.AbstractMemorySegmentImpl#isSet()
1.38%         12362         org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#increment()
1.34%         12016         org.apache.lucene.queries.spans.TermSpans#nextStartPosition()
1.16%         10395         org.apache.lucene.search.TermScorer#score()
1.16%         10338         jdk.internal.foreign.AbstractMemorySegmentImpl#checkBounds()
1.10%         9856          org.apache.lucene.util.packed.DirectReader$DirectPackedReader4#get()
1.01%         9014          org.apache.lucene.queries.intervals.IntervalFilter#nextInterval()
0.96%         8580          jdk.internal.foreign.SharedScope#checkValidState()
0.93%         8349          org.apache.lucene.index.SingletonSortedSetDocValues#getValueCount()
0.90%         8020          org.apache.lucene.search.ScoreCachingWrappingScorer#score()
0.86%         7654          org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$DenseNumericDocValues#advance()
0.82%         7361          org.apache.lucene.queries.spans.SpanScorer#setFreqCurrentDoc()
0.82%         7328          org.apache.lucene.search.Weight$DefaultBulkScorer#scoreAll()

versus baseline CPU JFR profiler results:

PROFILE SUMMARY from 894453 events (total: 894453)
  tests.profile.mode=cpu
  tests.profile.count=30
  tests.profile.stacksize=1
  tests.profile.linenumbers=false
PERCENT       CPU SAMPLES   STACK
5.93%         53070         org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$20#ordValue()
4.26%         38078         org.apache.lucene.index.SingletonSortedNumericDocValues#nextDoc()
3.84%         34318         org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegment()
3.65%         32685         jdk.internal.misc.Unsafe#convEndian()
2.86%         25554         org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$4#longValue()
2.74%         24483         org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$EverythingEnum#nextPosition()
2.64%         23617         org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts#countAll()
2.18%         19515         org.apache.lucene.search.ConjunctionDISI#doNext()
2.17%         19373         org.apache.lucene.util.packed.DirectReader$DirectPackedReader4#get()
2.12%         18958         org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$3#longValue()
1.93%         17298         org.apache.lucene.util.packed.DirectReader$DirectPackedReader20#get()
1.93%         17258         org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$EverythingEnum#advance()
1.82%         16284         org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$EverythingEnum#skipPositions()
1.75%         15647         org.apache.lucene.search.TermScorer#score()
1.71%         15292         org.apache.lucene.codecs.lucene90.ForUtil#expand8()
1.67%         14979         org.apache.lucene.queries.intervals.OrderedIntervalsSource$OrderedIntervalIterator#nextInterval()
1.65%         14744         org.apache.lucene.store.ByteBufferGuard#ensureValid()
1.57%         14061         org.apache.lucene.queries.spans.NearSpansOrdered#stretchToOrder()
1.15%         10247         org.apache.lucene.queries.spans.TermSpans#nextStartPosition()
1.14%         10222         java.util.Objects#checkIndex()
1.12%         9990          java.nio.Buffer#scope()
1.06%         9459          org.apache.lucene.store.ByteBufferGuard#getByte()
0.98%         8724          org.apache.lucene.queries.intervals.IntervalFilter#nextInterval()
0.91%         8179          org.apache.lucene.search.Weight$DefaultBulkScorer#scoreAll()
0.88%         7906          org.apache.lucene.search.ScoreCachingWrappingScorer#score()
0.88%         7867          org.apache.lucene.store.ByteBufferIndexInput#buildSlice()
0.87%         7823          org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$DenseNumericDocValues#advance()
0.87%         7789          org.apache.lucene.store.ByteBufferGuard#getInt()
0.84%         7518          org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#increment()
0.74%         6639          org.apache.lucene.codecs.lucene90.Lucene90NormsProducer$3#longValue()

It's curious how costly SingletonSortedNumericDocValues#nextDoc is. I think these facet fields are dense.

Dec 21 '21 19:12 mikemccand

Thanks Mike, I saw similar results a month ago.

It is very important to do the following:

don't disable tiered compilation and don't enable batch
use large index and long lifetime of a single JVM

Reason: panama uses many modern language features like varhandles that have a lot of costly runtime library bootstraps (like dynamic bytecode). If you disable tiered it can't adapt to this. So optimization is split into runtime and Hotspot. This only works good when using tiered compilation.

Dec 21 '21 20:12 uschindler

Also here is the heap JFR results for base:

PROFILE SUMMARY from 2423 events (total: 94219M)
  tests.profile.mode=heap
  tests.profile.count=30
  tests.profile.stacksize=1
  tests.profile.linenumbers=false
PERCENT       HEAP SAMPLES  STACK
9.76%         9191M         org.apache.lucene.util.FixedBitSet#<init>()
6.93%         6527M         org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegment()
6.07%         5721M         org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame#<init>()
5.77%         5437M         org.apache.lucene.facet.FacetsConfig#stringToPath()
5.20%         4896M         java.util.concurrent.locks.AbstractQueuedSynchronizer#acquire()
3.65%         3436M         perf.StatisticsHelper#startStatistics()
3.65%         3436M         java.util.concurrent.CopyOnWriteArrayList#iterator()
3.46%         3255M         java.lang.StringUTF16#compress()
3.39%         3195M         java.util.ArrayList#grow()
2.97%         2800M         org.apache.lucene.util.BytesRef#utf8ToString()
2.91%         2743M         org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#<init>()
2.69%         2534M         jdk.internal.misc.Unsafe#allocateUninitializedArray()
2.54%         2396M         java.util.AbstractList#iterator()
2.33%         2199M         org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum#<init>()
2.20%         2070M         org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#getFrame()
2.01%         1889M         org.apache.lucene.util.fst.ByteSequenceOutputs#read()
1.82%         1718M         org.apache.lucene.queryparser.charstream.FastCharStream#refill()
1.81%         1709M         org.apache.lucene.util.BytesRef#<init>()
1.48%         1392M         org.apache.lucene.util.DocIdSetBuilder$Buffer#<init>()
1.29%         1219M         org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame#load()
1.25%         1176M         org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#<init>()
1.23%         1159M         org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$TermsDict#decompressBlock()
1.20%         1133M         jdk.internal.foreign.MappedMemorySegmentImpl#dup()
1.14%         1073M         org.apache.lucene.search.BooleanScorer#<init>()
1.03%         970M          org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#<init>()
1.03%         970M          org.apache.lucene.search.ExactPhraseMatcher$1#getImpacts()
0.97%         910M          org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#getArc()
0.88%         833M          org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsPostingsEnum#<init>()
0.88%         833M          org.apache.lucene.codecs.lucene90.ForUtil#<init>()
0.87%         816M

and for new-mmap:

PROFILE SUMMARY from 2424 events (total: 96906M)
  tests.profile.mode=heap
  tests.profile.count=30
  tests.profile.stacksize=1
  tests.profile.linenumbers=false
PERCENT       HEAP SAMPLES  STACK
10.17%        9854M         org.apache.lucene.util.FixedBitSet#<init>()
6.96%         6743M         org.apache.lucene.facet.FacetsConfig#stringToPath()
6.84%         6631M         org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegment()
5.16%         4999M         java.util.concurrent.locks.AbstractQueuedSynchronizer#acquire()
4.18%         4054M         org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame#<init>()
3.58%         3470M         org.apache.lucene.util.BytesRef#utf8ToString()
3.55%         3436M         perf.StatisticsHelper#startStatistics()
3.55%         3436M         java.util.concurrent.CopyOnWriteArrayList#iterator()
3.49%         3379M         org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#<init>()
2.77%         2688M         java.util.ArrayList#grow()
2.71%         2628M         java.lang.StringUTF16#compress()
2.54%         2465M         org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#getFrame()
2.15%         2087M         java.util.AbstractList#iterator()
2.05%         1984M         org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum#<init>()
1.94%         1881M         org.apache.lucene.util.fst.ByteSequenceOutputs#read()
1.84%         1786M         jdk.internal.misc.Unsafe#allocateUninitializedArray()
1.74%         1683M         java.nio.DirectByteBufferR#duplicate()
1.65%         1598M         org.apache.lucene.util.DocIdSetBuilder$Buffer#<init>()
1.47%         1426M         org.apache.lucene.util.BytesRef#<init>()
1.40%         1357M         java.nio.DirectByteBufferR#asLongBuffer()
1.27%         1228M         org.apache.lucene.search.ExactPhraseMatcher$1#getImpacts()
1.24%         1202M         org.apache.lucene.queryparser.charstream.FastCharStream#refill()
1.24%         1202M         org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#newTermState()
1.15%         1116M         org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#<init>()
1.13%         1090M         org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#<init>()
1.10%         1065M         org.apache.lucene.store.ByteBufferIndexInput#newCloneInstance()
1.05%         1013M         java.nio.DirectByteBufferR#slice()
0.91%         884M          org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsPostingsEnum#<init>()
0.88%         850M          org.apache.lucene.search.BooleanScorer#<init>()
0.86%         833M

Dec 21 '21 21:12 mikemccand

It is very important to do the following:

don't disable tiered compilation and don't enable batch

use large index and long lifetime of a single JVM

Thanks @uschindler -- I think I am not disabling tiered compilation.

I run 20 iterations of each task, but it is relatively quick (~13-14 seconds per JVM instance). I can retry with more per-task iterations to see if the tiered compilation can improve things later on.

Here's the full JVM command of base and competitor:

  iter 19
    new-mmap:
      log: /l/logs.nightly/prefix.new-mmap.19 + stdout
      run: /opt/jdk-18-ea-28/bin/java --add-modules jdk.incubator.foreign -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp -XX:StartFlightRecording=dumponexit=true,maxsize=250M,settings=/l/util/src/\
python/profiling.jfc,filename=/l/util/bench-search-prefix-new-mmap-19.jfr -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -classpath /l/trunk-new-mmap/lucene/core/build/libs/lucene-core-10.0.0-SNAPSHOT.jar:/l/\
trunk-new-mmap/lucene/core/build/classes/java/test:/l/trunk-new-mmap/lucene/sandbox/build/classes/java/main:/l/trunk-new-mmap/lucene/misc/build/classes/java/main:/l/trunk-new-mmap/lucene/facet/build/classes/java/main\
:/l/trunk-new-mmap/lucene/analysis/common/build/classes/java/main:/l/trunk-new-mmap/lucene/analysis/icu/build/classes/java/main:/l/trunk-new-mmap/lucene/queryparser/build/classes/java/main:/l/trunk-new-mmap/lucene/gr\
ouping/build/classes/java/main:/l/trunk-new-mmap/lucene/suggest/build/classes/java/main:/l/trunk-new-mmap/lucene/highlighter/build/classes/java/main:/l/trunk-new-mmap/lucene/codecs/build/classes/java/main:/l/trunk-ne\
w-mmap/lucene/queries/build/classes/java/main:/home/mike/.gradle/caches/modules-2/files-2.1/com.carrotsearch/hppc/0.9.0/fcc952fb6d378266b943bef9f15e67a4d45cfa88/hppc-0.9.0.jar:/l/util/lib/HdrHistogram.jar:/l/util/bui\
ld perf.SearchPerfTest -dirImpl MMapDirectory -indexPath /l/indices/wikimediumall.trunk.facets.taxonomy:Date.taxonomy:Month.taxonomy:DayOfYear.taxonomy:RandomLabel.taxonomy.sortedset:Month.sortedset:DayOfYear.sorteds\
et:RandomLabel.sortedset.Lucene90.Lucene90.dvfields.nd27.625M -facets taxonomy:Date;Date -facets taxonomy:Month;Month -facets taxonomy:DayOfYear;DayOfYear -facets taxonomy:RandomLabel.taxonomy;RandomLabel -facets sor\
tedset:Month;Month -facets sortedset:DayOfYear;DayOfYear -facets sortedset:RandomLabel.sortedset;RandomLabel -analyzer StandardAnalyzerNoStopWords -taskSource /l/util/tasks/wikimedium.10M.nostopwords.tasks -searchThr\
eadCount 6 -taskRepeatCount 20 -field body -tasksPerCat 1 -staticSeed -291966 -seed -5054409 -similarity BM25Similarity -commit multi -hiliteImpl FastVectorHighlighter -log /l/logs.nightly/prefix.new-mmap.19 -topN 10\
 -pk
      13.1 s
    base:
      log: /l/logs.nightly/prefix.base.19 + stdout
      run: /opt/jdk-18-ea-28/bin/java --add-modules jdk.incubator.foreign -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp -XX:StartFlightRecording=dumponexit=true,maxsize=250M,settings=/l/util/src/\
python/profiling.jfc,filename=/l/util/bench-search-prefix-base-19.jfr -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -classpath /l/trunk/lucene/core/build/libs/lucene-core-10.0.0-SNAPSHOT.jar:/l/trunk/lucene/\
core/build/classes/java/test:/l/trunk/lucene/sandbox/build/classes/java/main:/l/trunk/lucene/misc/build/classes/java/main:/l/trunk/lucene/facet/build/classes/java/main:/l/trunk/lucene/analysis/common/build/classes/ja\
va/main:/l/trunk/lucene/analysis/icu/build/classes/java/main:/l/trunk/lucene/queryparser/build/classes/java/main:/l/trunk/lucene/grouping/build/classes/java/main:/l/trunk/lucene/suggest/build/classes/java/main:/l/tru\
nk/lucene/highlighter/build/classes/java/main:/l/trunk/lucene/codecs/build/classes/java/main:/l/trunk/lucene/queries/build/classes/java/main:/home/mike/.gradle/caches/modules-2/files-2.1/com.carrotsearch/hppc/0.9.0/f\
cc952fb6d378266b943bef9f15e67a4d45cfa88/hppc-0.9.0.jar:/l/util/lib/HdrHistogram.jar:/l/util/build perf.SearchPerfTest -dirImpl MMapDirectory -indexPath /l/indices/wikimediumall.trunk.facets.taxonomy:Date.taxonomy:Mon\
th.taxonomy:DayOfYear.taxonomy:RandomLabel.taxonomy.sortedset:Month.sortedset:DayOfYear.sortedset:RandomLabel.sortedset.Lucene90.Lucene90.dvfields.nd27.625M -facets taxonomy:Date;Date -facets taxonomy:Month;Month -fa\
cets taxonomy:DayOfYear;DayOfYear -facets taxonomy:RandomLabel.taxonomy;RandomLabel -facets sortedset:Month;Month -facets sortedset:DayOfYear;DayOfYear -facets sortedset:RandomLabel.sortedset;RandomLabel -analyzer St\
andardAnalyzerNoStopWords -taskSource /l/util/tasks/wikimedium.10M.nostopwords.tasks -searchThreadCount 6 -taskRepeatCount 20 -field body -tasksPerCat 1 -staticSeed -291966 -seed -5054409 -similarity BM25Similarity -\
commit multi -hiliteImpl FastVectorHighlighter -log /l/logs.nightly/prefix.base.19 -topN 10 -pk
      13.7 s

Dec 21 '21 21:12 mikemccand

Hi @dweiss, I'm so sorry. I just merged yesterday but didn't run tests. The test is obsolete on branch.

Dec 22 '21 08:12 uschindler

No worries - just tweak whatever you want. I just didn't want jenkins to complain over and over.

Dec 22 '21 08:12 dweiss

I corrected the test to assert the jdk.unsupported.foreign module. But as this won't compile without it, it is in fact obsolete. But for easier merging in future I left it. I also modified the previous JDK 17 and JDK 16 pull requests.

Next step is to merge up the changes from yesterday.

Dec 22 '21 08:12 uschindler

I re-ran benchmarks with more iterations per task (200 vs 20 before) to let hotspot have more time to optimize in each of the 20 JVMs:

                            Task    QPS base      StdDevQPS new-mmap      StdDev                Pct diff p-value
           BrowseMonthTaxoFacets        4.71      (6.2%)        4.39      (3.3%)   -6.8% ( -15% -    2%) 0.000
           BrowseMonthSSDVFacets        7.86     (14.8%)        7.34     (10.9%)   -6.7% ( -28% -   22%) 0.104
               HighTermMonthSort       59.25     (12.5%)       57.42     (11.8%)   -3.1% ( -24% -   24%) 0.420
                        PKLookup      138.18      (1.1%)      135.03      (1.6%)   -2.3% (  -4% -    0%) 0.000
                          Fuzzy1       45.86      (1.3%)       45.15      (1.5%)   -1.6% (  -4% -    1%) 0.000
                HighSloppyPhrase        7.90      (5.1%)        7.78      (5.1%)   -1.6% ( -11% -    9%) 0.332
                          Fuzzy2       38.89      (1.1%)       38.30      (1.3%)   -1.5% (  -3% -    0%) 0.000
                       MedPhrase       14.90      (2.7%)       14.72      (3.6%)   -1.2% (  -7% -    5%) 0.230
                         Respell       36.40      (1.2%)       36.05      (1.5%)   -0.9% (  -3% -    1%) 0.029
                       LowPhrase       21.59      (2.1%)       21.43      (2.5%)   -0.7% (  -5% -    3%) 0.304
       BrowseDayOfYearTaxoFacets        4.58     (12.3%)        4.56     (10.4%)   -0.6% ( -20% -   25%) 0.861
     BrowseRandomLabelTaxoFacets        4.06      (8.1%)        4.04      (7.5%)   -0.6% ( -14% -   16%) 0.817
            BrowseDateTaxoFacets        4.57     (12.2%)        4.55     (10.5%)   -0.4% ( -20% -   25%) 0.904
                      OrHighHigh       18.30      (3.4%)       18.22      (2.5%)   -0.4% (  -6% -    5%) 0.663
                        Wildcard       42.52     (13.2%)       42.55     (12.7%)    0.1% ( -22% -   29%) 0.989
        AndHighHighDayTaxoFacets       21.35      (2.1%)       21.36      (1.8%)    0.1% (  -3% -    4%) 0.926
                       OrHighMed       91.02      (3.1%)       91.16      (2.4%)    0.2% (  -5% -    5%) 0.860
            MedTermDayTaxoFacets       22.74      (5.0%)       22.78      (3.0%)    0.2% (  -7% -    8%) 0.902
            HighIntervalsOrdered        3.79      (2.9%)        3.80      (3.3%)    0.2% (  -5% -    6%) 0.814
                          IntNRQ       93.38      (8.2%)       93.60      (1.0%)    0.2% (  -8% -   10%) 0.897
            HighTermTitleBDVSort        4.38      (5.9%)        4.40      (4.6%)    0.3% (  -9% -   11%) 0.841
          OrHighMedDayTaxoFacets        8.05      (3.2%)        8.08      (4.2%)    0.3% (  -6% -    8%) 0.770
                     AndHighHigh       23.23      (2.8%)       23.31      (3.5%)    0.4% (  -5% -    6%) 0.705
         AndHighMedDayTaxoFacets       42.68      (2.2%)       42.87      (1.6%)    0.5% (  -3% -    4%) 0.455
                      TermDTSort      667.66      (1.7%)      670.71      (1.3%)    0.5% (  -2% -    3%) 0.342
                      AndHighMed       98.30      (3.1%)       98.84      (3.3%)    0.5% (  -5% -    7%) 0.589
                         Prefix3       33.28      (3.4%)       33.47      (4.3%)    0.6% (  -6% -    8%) 0.645
                       OrHighLow      404.97      (3.6%)      407.82      (2.1%)    0.7% (  -4% -    6%) 0.446
                 LowSloppyPhrase       62.46      (2.6%)       62.91      (2.1%)    0.7% (  -3% -    5%) 0.334
                   OrNotHighHigh      713.35      (3.2%)      719.56      (2.5%)    0.9% (  -4% -    6%) 0.332
                         LowTerm      951.78      (1.9%)      960.95      (1.7%)    1.0% (  -2% -    4%) 0.093
                 MedSloppyPhrase       28.24      (2.4%)       28.57      (2.3%)    1.2% (  -3% -    5%) 0.115
                         MedTerm     1173.70      (2.1%)     1187.71      (2.7%)    1.2% (  -3% -    6%) 0.121
                    OrNotHighMed      576.27      (2.1%)      583.63      (1.7%)    1.3% (  -2% -    5%) 0.035
                      HighPhrase      201.44      (3.4%)      204.22      (1.7%)    1.4% (  -3% -    6%) 0.101
           HighTermDayOfYearSort     1822.34      (2.1%)     1847.93      (1.5%)    1.4% (  -2% -    5%) 0.015
     BrowseRandomLabelSSDVFacets        5.22      (7.4%)        5.30      (7.2%)    1.5% ( -12% -   17%) 0.515
                    OrHighNotLow      811.72      (3.1%)      824.52      (2.7%)    1.6% (  -4% -    7%) 0.086
                        HighTerm      963.95      (3.3%)      979.31      (3.6%)    1.6% (  -5% -    8%) 0.142
             LowIntervalsOrdered       55.56      (3.7%)       56.49      (3.3%)    1.7% (  -5% -    8%) 0.133
             MedIntervalsOrdered       12.72      (3.9%)       12.95      (3.7%)    1.8% (  -5% -    9%) 0.141
                     MedSpanNear        7.21      (3.7%)        7.37      (3.2%)    2.1% (  -4% -    9%) 0.049
                   OrHighNotHigh      781.73      (2.7%)      799.12      (2.7%)    2.2% (  -3% -    7%) 0.009
                      AndHighLow      514.79      (4.0%)      526.55      (1.3%)    2.3% (  -2% -    7%) 0.015
                     LowSpanNear       22.27      (2.6%)       22.80      (2.2%)    2.4% (  -2% -    7%) 0.002
                    OrNotHighLow      477.71      (5.1%)      489.15      (1.3%)    2.4% (  -3% -    9%) 0.042
                    OrHighNotMed      913.86      (2.3%)      935.89      (2.3%)    2.4% (  -2% -    7%) 0.001
                    HighSpanNear       14.42      (4.9%)       14.83      (3.9%)    2.8% (  -5% -   12%) 0.044
       BrowseDayOfYearSSDVFacets        7.33     (21.1%)        7.79     (17.2%)    6.3% ( -26% -   56%) 0.301

Dec 22 '21 22:12 mikemccand

Thanks Mike. It is mich more stable now (std dev) and on average 0%. We should figure out why it gets faster on some parts while slower on other parts.

What is different:

some parts use direct access off-heap
some parts mainly copy byte arrays between mmap and heap and do the work on-heap

From what I have learned, copy operations have high overhead because:

they are not hot, so aren't optimized so fast
when not optimized, the setup cost is high (lots of class checks to get array type, decision for swapping bytes). This is especially heavy for small arrays.

When discussing with Robert it looks like it might be better to just have a simple copy-loop. This affects long[] arrays, as those are <64 entries. We can test this easily by commenting out the copy method for floats and longs, so it falls back to default impl in IndexInput.

I just had no time to test this.

Dec 22 '21 23:12 uschindler

But on long term we should do everything off-heap. Especially the vector stuff. But for that we need to change IndexInput to allow to return Float vector or LongVector instances backed off-heap. The default impl would just copy as before and return a view.

But that needs to wait until vector API goes out of incubator and preview phases.

Dec 22 '21 23:12 uschindler

From what I have learned, copy operations have high overhead because:

* they are not hot, so aren't optimized so fast

* when not optimized, the setup cost is high (lots of class checks to get array type, decision for swapping bytes). This is especially heavy for small arrays.

Hi, I'm not sure as to why copy operations should be slower in the memory access API than with the ByteBuffer API. I would expect most of the checks to be similar (except for the liveness tests of the segment involved). I do recall that the ByteBuffer API does optimize bulk copy for very small buffers (I don't recall what the limit is, but it was very very low, like 4 elements or something).

In principle, this JVM fix (as per 18) should help too: https://bugs.openjdk.java.net/browse/JDK-8269119

Jan 05 '22 18:01 mcimadamore

I'm working on a similar approach for my data store, but I'm currently not sure if it's a good idea for multiple readers plus a single reader/writer to map a segment for each reader. I guess the OS will then share the mapped regions/the pages between the mapped memory segments? Not sure if it's the same approach in Lucene, so that you'd create multiple IndexInputs for multiple index readers, because you also seem to a have a clone method (but it will fail once the segments are closed from one reader).

On another note, what's your take on this (Andy and Victor are real genioses regarding database systems)? http://cidrdb.org/cidr2022/papers/p13-crotty.pdf

Mar 31 '22 12:03 JohannesLichtenberger

I'm working on a similar approach for my data store, but I'm currently not sure if it's a good idea for multiple readers plus a single reader/writer to map a segment for each reader. I guess the OS will then share the mapped regions/the pages between the mapped memory segments? Not sure if it's the same approach in Lucene, so that you'd create multiple IndexInputs for multiple index readers, because you also seem to a have a clone method (but it will fail once the segments are closed from one reader).

This PR does not change anything in Lucene's current behaviour. The code using MappedByteBuffer behaves the same way. There are also no multiple mappings. If a user may open several IndexReaders on the same index that's not our fault. Well behaving code of Lucene only opens a single IndexReader.

The clone() method is used for several threads. There is no remapping, we only refcount the ResourceContext with Panama. If you close the main index, the clones used by different threads should really fail then - thats the improvement here.

On another note, what's your take on this (Andy and Victor are real genioses regarding database systems)? http://cidrdb.org/cidr2022/papers/p13-crotty.pdf

We don't agree with that for Lucene:

the model behind Lucene is different: All files are write-once so there are no updates to files which were written before. MMAP only works on files that are never changed anymore. The paging works very well with those.
we do not write with mmap, lucene index files are written with standard output streams

Mar 31 '22 13:03 uschindler

So, if the writer adds something to the lucene index (not via mmap) new index readers will create a new IndexInput with new mapped memory segments plus clones with the same segments for different threads, right? Isn't it a valid use case here to have multiple index readers or is it supposed to close the index reader and the clones first?

In my case (SirixDB) you're supposed to create multiple read-only transactions bound to a specific revision if you like to use different threads for instance (currently I've shared a memory segment). However the writer appends to the data file and a new read-only trx thus has to get a new map of the file (to read the most recently committed revision). Thus, either a new memory segment must be set and even set for all readers guarded with a Semaphore for instance or I'll have to use multiple memory segments, but I guess that's a bad approach.

Either way it would be similar (multiple readers and a single writer which only ever appends data). In your case you're cloning the IndexInput and append to the data file(s) without a memory mapping. However, it's not clear to me if it's a valid use case for lucene to have multiple index readers, but you suggest that it's not the approach to use. But how do you make sure that new index readers (cloned or not) will see all the changes?

Mar 31 '22 13:03 JohannesLichtenberger

quoting from Uwe:

All files are write-once so there are no updates to files which were written before.

This is the key piece that I think you are missing. We write files once, that's it. No appending to them after the fact or anything like that. Any new changes will be written to new, different files.

Mar 31 '22 14:03 rmuir

Oh right, thanks. That's the big difference.

Mar 31 '22 14:03 JohannesLichtenberger

Thanks @rmuir for the clirifaction.

To add, because also @mcimadamore asked: We use shared segments, because we only allocate and map the segment once. It is then used by multiple threads. The IndexReader only opens every new file only once and then mmap the files. Several search threads may access the mapped files concurrently (very small files are kept on heap until the next commit, this is what NRT caching directory does). Every search thread may use a clone of the IndexInput because the IndexInput has per-thread state information (like a read position), but the underlying memory segments are reused and the Indexreader only closes the "main" IndexInput. Possible clones will get invalid.

On changes of index and after final commit, new files are writen to disk and fsynced (including the directory metadata). IndexWriter gets reopened and mmaps new files seen and releases old and no longer used ones by a close.

Any thread that still accesses already closed files will get AlreadyClosedException (previously those may have sigsegfaulted due to forceful unmapping of MappedByteBuffer). With MMapDirectory using MemorySegments this is detected by IllegalStateException and transformed to AlreadyClosed and seen by search threads. So all is sane.

Mar 31 '22 15:03 uschindler

Thanks for your great explanation. Makes a lot of sense.

Do you know if the MAP_SHARED flag is set for mapped memory segments? I guess this means that even if I'm opening a few mapped memory segments on the same file (even if they might only have overlapping regions and the segments will grow in size as the file grows) the virtual address space and the loaded pages will be shared when opened in the same FileChannel.MapMode (READ_ONLY) mode for instance. I also think setting madivse for random index accesses in my case would be of advantage, as the main trie index access pattern might be random.

       MAP_SHARED
              Share this mapping.  Updates to the mapping are visible to other
              processes  mapping  the  same  region, and (in the case of file-
              backed mappings) are carried through  to  the  underlying  file.
              (To  precisely  control  when updates are carried through to the
              underlying file requires the use of msync(2).)

Mar 31 '22 16:03 JohannesLichtenberger

Thanks for your great explanation. Makes a lot of sense.

Do you know if the MAP_SHARED flag is set for mapped memory segments? I guess this means that even if I'm opening a few mapped memory segments on the same file (even if they might only have overlapping regions and the segments will grow in size as the file grows) the virtual address space and the loaded pages will be shared when opened in the same FileChannel.MapMode (READ_ONLY) mode for instance. I also think setting madivse for random index accesses in my case would be of advantage, as the main trie index access pattern might be random.
       MAP_SHARED
              Share this mapping.  Updates to the mapping are visible to other
              processes  mapping  the  same  region, and (in the case of file-
              backed mappings) are carried through  to  the  underlying  file.
              (To  precisely  control  when updates are carried through to the
              underlying file requires the use of msync(2).)

Hi, IIRC, the SHARED flag should be set - that said, with the foreign API it is also possible to define custom memory mapped segments, if some of the defaults picked by the JDK is not suitable. Few months ago I put together a Gist [1] to illustrate that. I have updated that to reflect the new API changes. Perhaps Lucene might, one day, take advantage of this.

[1] - https://gist.github.com/mcimadamore/128ee904157bb6c729a10596e69edffd

Mar 31 '22 18:03 mcimadamore

Closing this as the JDK 19 impl was merged (#912).

Oct 01 '22 17:10 uschindler

lucene lucene copied to clipboard

Initial rewrite of MMapDirectory for JDK-18 preview (incubating) Panama APIs (>= JDK-18-ea-b26)

lucene
lucene copied to clipboard