lucene MR-JAR rewrite of MMapDirectory with JDK-19 preview Panama APIs (>= JDK-19-ea+23)

INFO: This is a followup of #518: It's the same code base, but with API changes from JDK 19 applied

This is just a draft PR for a first insight on memory mapping improvements in JDK 19+.

Some background information: Starting with JDK-14, there is a new incubating module "jdk.incubator.foreign" that has a new, not yet stable API for accessing off-heap memory (and later it will also support calling functions using classical MethodHandles that are located in libraries like .so or .dll files). This incubator module has several versions:

first version: https://openjdk.java.net/jeps/370 (slow, very buggy and thread confinement, so making it unuseable with Lucene)
second version: https://openjdk.java.net/jeps/383 (still thread confinement, but now allows transfer of "ownership" to other threads; this is still impossible to use with Lucene.
third version in JDK 16: https://openjdk.java.net/jeps/393 (this version has included "Support for shared segments"). This now allows us to safely use the same external mmaped memory from different threads and also unmap it! This was implemented in the previous pull request #173
fourth version in JDK 17: https://openjdk.java.net/jeps/412 . This mainly changes the API around the scopes. Instead of having segments explicitely made "shared", we can assign them to some resource scope which control their behaviour. The resourceScope is produced one time for each IndexInput instance (not clones) and owns all segments. When the resourceScope is closed, all segments get invalid - and we throw AlreadyClosedException. The big problem is slowness due to heavy use of new instances just to copy memory between segments and java heap. This drives garbage collector crazy. This was implemented in previous PR #177
fifth version in JDK 18, included in build 26: https://openjdk.java.net/jeps/419. This mainly cleans up the API. From Lucene's persepctive the MemorySegment API now has System.arraycopy()-like APIs to copy memory between heap and memory segments. This improves speed. It also handles byte-swapping automatically. This version of the PR also uses ValueLayout instead of varhandles, as it makes code more readable and type-safe.
sixth version in JDK 19, included with build 23: https://openjdk.java.net/jeps/424 (actual version). This version moves the whole incubation API as a stable "preview API". Some classes were renamed and bugs (e.g. on Windows with huge mappings) were resolved.

This new preview API more or less overcomes several problems:

ByteBuffer API is limited to 32bit (in fact MMapDirectory has to chunk in 1 GiB portions)
There is no official way to unmap ByteBuffers when the file is no longer used. There is a way to use sun.misc.Unsafe and forcefully unmap segments, but any IndexInput accessing the file from another thread will crush the JVM with SIGSEGV or SIGBUS. We learned to live with that and we happily apply the unsafe unmapping, but that's the main issue.

@uschindler had many discussions with the team at OpenJDK and finally with the third incubator, we have an API that works with Lucene. It was very fruitful discussions (thanks to @mcimadamore !)

With the third incubator we are now finally able to do some tests (especially performance).

The code basically just modifies MMapDirectory to use LONG instead of INT for the chunk size parameter. In addition it adds MemorySegmentIndexInput that is a copy of our ByteBufferIndexInput (still there, but unused), but using MemorySegment instead of ByteBuffer behind the scenes. It works in exactly the same way, just the try/catch blocks for supporting EOFException or moving to another segment were rewritten.

It passes all tests and it looks like you can use it to read indexes. The default chunk size is now 16 GiB (but you can raise or lower it as you like; tests are doing this). Of course you can set it to Long.MAX_VALUE, in that case every index file is always mapped to one big memory mapping. My testing with Windows 10 have shown, that this is not a good idea!!!. Huge mappings fragment address space over time and as we can only use like 43 or 46 bits (depending on OS), the fragmentation will at some point kill you. So 16 GiB looks like a good compromise: Most files will be smaller than 6 GiB anyways (unless you optimize your index to one huge segment). So for most Lucene installations, the number of segments will equal the number of open files, so Elasticsearch huge user consumers will be very happy. The sysctl max_map_count may not need to be touched anymore.

In addition, this implements readLongs in a better way than @jpountz did (no caching or arbitrary objects). The new foreign-vector APIs will in future also be written with MemorySegment in its focus. So you can allocate a vector view on a MemorySegment and let the vectorizer fully work outside java heap inside our mmapped files! :-)_

It would be good if you could checkout this branch and try it in production.

According to speed tests it should be as fast as MMAPDirectory, partially also faster because less switching between byte-buffers is needed. With recent optimizations also long-based absolute access in loops should be faster.

But be aware:

You need JDK 17 to run Gradle (set JAVA_HOME to it)
You need JDK-19-ea+23 to compile (set JAVA19_HOME to it)
You need JDK-19-ea+23 to test (set RUNTIME_JAVA_HOME to it). You can still test with any older version, so RUNTIME_JAVA_HOME is optional, but then it will use MappedByteBuffer
The JAR file is a multi-release JAR (MR-JAR)
Also you need to add --enable-preview to the command line of your Java program/Solr server/Elasticsearch server. If you do not add this option, Lucene will silently use the old byte buffers. It will also print a warning. At moment, when Java 19 preview mode was detected, it will also print a log message stating so.

It would be good to get some benchmarks, especially by @rmuir or @mikemccand. Take your time and enjoy the complexity of setting this up! ;-)

May 20 '22 18:05 uschindler

I updated the PR to use an MR-JAR (multi release JAR) to implement the new functionality. It also uses Gradle's toolchain support to automatically download the JDK 19 release just to compile the MR part. This is commented out at the moment, because the Gradle toolchain API does not support EA releases.

The cool thing: Once OpenJDK 19 got out of the door in September, we can enable the toochanin support an then also backport this patch to Lucene 9.x.

The MR-JAR part works the following way:

The MappedByteBuffer part and all the Java 9+ hacks with Unsafe are moved to a new pkg-private interface MMapIndexInputProvider.
In the MR-JAR part for Java 19 there's a separate implementation of MMapIndexInputProvider, compiled with --enable-preview
the main MMapDirectory has a static initializer that tries to load and link the Java19 MemorySegmentIndexInputProvider. It uses method handles for that because reflection would need setAccessible(). If loading fails it logs a warning if Java 19 is used, but --enable-preview is missing on Java's command line. In all cases it falls back to the default MappedByteBufferIndexInputProvider implementation in the main part of JAR file.

The detection logic is here: https://github.com/apache/lucene/blob/a74055870de529cb14793a863ae84fd53795d517/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L325-L364

I also added distribution test that checks if the built JAR file is a correct MR-JAR and it also works with module system.

The current PR still requires Java 19 as RUNTIME_JAVA_HOME because of missing toolchain support. To test the "legacy" byte buffer implementation, one can comment the --enable-preview from the test's JVM args.

May 28 '22 23:05 uschindler

Hi @dweiss: Could you have a look at what I am planning, especially the MR-JAR and how I intend to use the Gradle toolchain API to compile against Java 19 in the future?

If you know a solution how to use EA releases with Toolchain API, tell me. I gave up on this. The URLs for AdoptOpenJDK are hardcoded in Gradle an theres no way to select EA releases in the JvmSpec, see: https://github.com/gradle/gradle/issues/14515 and https://github.com/gradle/gradle/issues/14814

May 28 '22 23:05 uschindler

It is now possible to compile and run tests with different JVMs:

If RUNTIME_JAVA_HOME does not point to exactly Java 19, it won't compile and give error message. This is needed until we have toolchain support.

To compile, you may pass JAVA19_HOME (this is what Jenkins does). You can then still randomize and run the tests against Java 17 (default, no RUNTIME_JAVA_HOME) or any other Java version. The compile JVM must be a Java 19 version (exact, otherwise compilation will fail, too).

By this Policeman Jenkins is now doing a full tets of all combinations and makes sure the MR-JAR and the provider approach used works on all platforms.

May 29 '22 13:05 uschindler

Hi Uwe. I recall I had problems with toolchains - some things just refused to work (or we couldn't customize them? Can't remember precisely what the problem was). I looked at the code you wrote and it seems all right at a glance. Some things will be awkward, like "rewinding" the javaHome set in other places... perhaps it'd be good to not set it there at all (move the toolchain declaration and runtime java selection into a single script) - can't say what will turn out to be better.

May 29 '22 20:05 dweiss

Hi Dawid, Thanks for feedback. Indeed the Tool chain API is very limited and in my honest opinion far from useful for Lucenes usages. There's no way to configure anything inside the Gradle files. Things like Autodownload or provider urls or locations of preinstalled jdks like on Jenkins cannot be put into the build files. It is also not possible to tell the system to download EA releases. It is all a desaster. I will tomorrow start a Twitter thread about this.

The tool chain API should be flexible and allow plugins to configure own repository types. In general the tool chains should be standard configurations and dependencies in Gradle. The way how it is now is a dead end: inflexible, buggy and looks like written by a student as bachelor thesis in isolation.

I would wish to use a configuration "toolkit" like any other and add the dependency to the java version there. Compile tasks would then use it automatically. Plugins can implement repositories in same way like Maven or Ivy. E.g., one plugin for adoptopenjdk/ eclipse and another one for EA releasesbof openjdk.

About your comments: I wanted to keep the tool chain JVM hacks at one place because they currently only affect Lucene Core. Once we have a release of Java 19 we can simplify that (see code commented out). Other tasks than JavaCompile and Jar are not affected. Javadocs are not built from custom source sets and ecj/forbiddenapis/errorprone do not work at all, so I disabled them. It is just 2 classes to compile which are package private. If we get more MR-JAR stuff in the future, we may think of better ways. The current approach is a separate "singleton" in whole build. Maybe you can make it generic, but the combination with EA releases makes this impossible at moment.

On top of that, preview APIs are another big problem. The class files generated have a special minor version in class file format marking them as preview. This makes class loader picky. The code here won't run with Java 20, we will need a separate MR-JAR section for 20 and then for final release of panama in LTS 21 with unmarked class files. If the APIs will not change we may simplify this by removing the minor class version by patching class file version to remove preview flag and only have the java 19 class files (it's a single byte at fixed position, not even ASM is needed).

You may read more about preview APIs in the jep, it's a mess to maintain: https://openjdk.java.net/jeps/12

At moment I also tried this PR with Luke:

if you start Luke with Java 17, all works as before, it won't see new classes and loads the byte buffer impl
if you start with Java 19, it will find the new classes, but class loader will reject to load them. Lucene will print a warning (you see it in Luke's log window and console). It will use byte buffer impl then, so all stays as is.
if you start Luke with Java 19 and --enable-preview, it will use new impl. So this is working only with opt in, which is good with preview APIs.
if you start with a possible future Java 20, with or without preview enabled, it will reject the class files and falls back to our old byte buffer impl. In that case it logs warning to update Lucene. 😉

So this looks like working good and people can optionally test new mmap with Java 19 if they opt in (Solr or Elasticsearch).

May 29 '22 21:05 uschindler

Thanks for the explanation, Uwe. In the meantime I recalled what it was about toolchains that was a showstopper - indeed it was the possibility to specify a toolchain pointing at a particular (local) installation - such as a custom-compiled JDK. I couldn't find a way to do it. If you take a look at alternative-jdk-support, it even has a comment about it:

// I failed to set it up leveraging Gradle's toolchains because
// a toolchain spec is not flexible enough to provide an exact location of the JVM to be used;
// if you have two identical JVM lang. versions in auto-discovered JVMs, an arbitrary one is used (?).
// This situation is not uncommon when debugging low-level stuff (hand-compiled JVM binaries).

Eh.

May 30 '22 05:05 dweiss

Thanks. Here we only need to compile against a specific exact version (19), no testing nothing (because user has JDK 17 or 11 when building lucene). So for the MR-JAR task this should be fine with auto-provisioning of JDK 19.

Testing of the Java 19 parts is done by Jenkins. Normal developer won't need that (unless directly working on the MR-JAR parts). We may think about running some very basic tests on the MMapDir with the autoprovisioned JVM, but for global tests: It's Jenkin's turn!

Testing MR-JAR files is still some complex issue.

May 30 '22 07:05 uschindler

Thanks @mocobeta, I just need to not forget to reenable github jobs in September when this gets merged (at this time, we can also use Autoprovision of JDK 19).

May 30 '22 07:05 uschindler

JDK 19 was released, I am working on the Toolchain support to support the compilation of the MR-JAR. At moment, the code commented out does not yet work, as AdoptOpenJDK / Temurin did not do a release yet: https://github.com/adoptium/adoptium/issues/171, https://github.com/adoptium/adoptium/issues/170

Sep 21 '22 08:09 uschindler

Current output:

Starting a Gradle Daemon (subsequent builds will be faster)
Directory 'C:\Users\Uwe Schindler\.gradle\daemon\7.3.3\(custom paths)' (system property 'org.gradle.java.installations.paths') used for java installations does not exist

> Task :errorProneSkipped
WARNING: errorprone disabled (skipped on builds not running inside CI environments, pass -Pvalidation.errorprone=true to enable)

> Task :lucene:core:compileMain19Java FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':lucene:core:compileMain19Java'.
> Error while evaluating property 'javaCompiler' of task ':lucene:core:compileMain19Java'
   > Failed to calculate the value of task ':lucene:core:compileMain19Java' property 'javaCompiler'.
      > Unable to download toolchain matching these requirements: {languageVersion=19, vendor=ADOPTOPENJDK, implementation=vendor-specific}
         > Unable to download toolchain. This might indicate that the combination (version, architecture, release/early access, ...) for the requested JDK is not available.
            > Could not read 'https://api.adoptopenjdk.net/v3/binary/latest/19/ga/windows/x64/jdk/hotspot/normal/adoptopenjdk' as it does not exist.

Sep 21 '22 12:09 uschindler

OpenJDK 19 was released by Eclipse Temurin: https://adoptium.net/de/temurin/releases?version=19

Thanks @gdams and @adoptium.

@msokolov: I will merge this to main, 9.x and 9.4 after adding changes.txt. When building the release you may need to remove your gradle.properties file in Lucene's directory. I will post a message.

Sep 26 '22 12:09 uschindler

@uschindler Just curious - did you get any feedback re. performance? Thanks

Oct 26 '22 09:10 mcimadamore

Hi @mcimadamore

@uschindler Just curious - did you get any feedback re. performance? Thanks

No official feedback yet. But Elasticsearch and Opensearch seem to benchmark. Unfortunately end-users have not yet used it on their side, because no release of Elasticserach/Opensearch using Lucene 9.4 was done yet. So we may only get feedback from pure Lucene users and there are not many.

For Apache Solr, the latest release was also without Lucene 9.4. In former times this ws easier, because at least Lucene and Solr were released together, but that's no longer the case. So we have to wait.

I will ask @mikemccand to at least enable --enable-preview on the nightly pure Lucene benchmark by default (and use JDK 19).

Oct 26 '22 10:10 uschindler

I will ask @mikemccand to at least enable --enable-preview on the nightly pure Lucene benchmark by default (and use JDK 19).

Ack -- I'll enable this starting from tonite's run.

Oct 26 '22 11:10 mikemccand

@mikemccand Maybe do this in a 2 step process:

First update to Java 19, so we get a clean state
Then add --enable-preview

If we do both at same time, we won't see a difference between old and new Lucene MMAP (on same version). A JDK upgrade may also change other performance numbers.

Oct 26 '22 16:10 uschindler

If we do both at same time, we won't see a difference between old and new Lucene MMAP (on same version). A JDK upgrade may also change other performance numbers.

Ack -- I turned on just the JDK 19 upgrade last night!

PKLookup looks a bit happy about the upgrade, but otherwise I don't see other tasks showing much impact.

I'll leave this for a few days, then turn on --enable-preview.

Oct 27 '22 09:10 mikemccand

If we do both at same time, we won't see a difference between old and new Lucene MMAP (on same version). A JDK upgrade may also change other performance numbers.

Ack -- I turned on just the JDK 19 upgrade last night!

PKLookup looks a bit happy about the upgrade, but otherwise I don't see other tasks showing much impact.

I'll leave this for a few days, then turn on --enable-preview.

OK, looks fine thanks!

Once you switched --enable-preview on, you should no longer see stack traces like this in the profile:

1.84%         209556        jdk.internal.misc.Unsafe#convEndian()
                              at jdk.internal.misc.Unsafe#getIntUnaligned()
                              at jdk.internal.misc.ScopedMemoryAccess#getIntUnalignedInternal()
                              at jdk.internal.misc.ScopedMemoryAccess#getIntUnaligned()
                              at java.nio.DirectByteBuffer#getInt()
                              at java.nio.DirectByteBuffer#getInt()
                              at org.apache.lucene.store.ByteBufferGuard#getInt()
                              at org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl#readInt()

They should be visible as calls from MemorySegmentIndexInput to MemorySegment.get(); ByteBufferIndexInput calling DirectByteBuffer and so on should diappear.

Oct 27 '22 09:10 uschindler

Please do not wonder why "convEndian" is always on top of the call stack: The method is always called, although the endianness fits. It is just a NOOP on x86.

Oct 27 '22 09:10 uschindler

After Mike switched to preview mode the results look good. The speed with MemorySegmentIndexInput ist similar to old ByteBuffer code.

https://home.apache.org/~mikemccand/lucenebench/

See change EZ. Of course what we can't test is highly concurrent reopening of index and therefore closing shared MemorySessions at high ratio.

CC @mcimadamore

Nov 10 '22 23:11 uschindler

@uschindler FYI it looks like this change made indexing ~12% faster with BEST_SPEED on the stored fields benchmark: http://people.apache.org/~mikemccand/lucenebench/stored_fields_benchmarks.html

Nov 14 '22 14:11 jpountz

@uschindler FYI it looks like this change made indexing ~12% faster with BEST_SPEED on the stored fields benchmark: http://people.apache.org/~mikemccand/lucenebench/stored_fields_benchmarks.html

That is interesting. I did not look at indexing, because we write there. I think this is because of maybe bulk merging was faster, am I correct? The new code is possibly cheaper with copying large parts from/to heap.

Nov 14 '22 15:11 uschindler

I haven't dug, but your reasoning could explain why we are seeing a similar speedup across both BEST_SPEED and BEST_COMPRESSION since they benefit from the bulk merging optimization in a similar way.

Nov 14 '22 15:11 jpountz

It does not fully explain it, because actually the copy of memory between file and heap is both using Unsafe#copyMemory. The only difference is that for ByteBuffer there is some hardcoded limit of 8 or 16 bytes (not exactly sure, have to look it up). If the byte[] not larger than it uses a copy-loop. MemorySegment does not have this loop. This may explain that some vector stuff got a bit slower, because when you copy a very short long[] from offheap to heap, it will just read those longs in a sequence and not use Unsafe#copyMemory which has overhead for small arrays.

In future it would be great if Panama would provide a method in OutputStream like stream.writeBytes(MemorySegment segment, long offset, long count). This would allow to not copy at all and use a dd like approach. We could then add IndexInput#copyTo to our API. I think we worked on something similar, but for copying a complete off-heap approch would be best.

Nov 14 '22 16:11 uschindler

one issue with "fancy" copies (e.g. for bulk merging) is that we can't make them super-fancy anyway (e.g. zero-copy from fd to fd) as we still have to calculate CRC32 checksum.

Nov 14 '22 16:11 rmuir

Yes you're right!

Nov 14 '22 17:11 uschindler

Maybe at some point JDK allows to call Checksum#update(MemorySegment,...). Then we don't need to copy multiple times, but the checksum will be calculated using the off-heap slice.

I have the feeling that the JDK people are planning to add MemorySegment at more places in Java's public API. At least everywhere where I/O is involved and bulk pocessing of Memory is required. They started already for vectors.

Nov 14 '22 17:11 uschindler

@uschindler It looks like this benchmark has a noticeable slowdown with this change? https://home.apache.org/~mikemccand/lucenebench/MedTermDayTaxoFacets.html

Nov 16 '22 18:11 jpountz

Could be. It may be caused by too much coping of very small arrays (especially float and long) between offheap and heap. For explanation see above.

I don't think this is a huge problem because it mainly comes from missing optimizations in the JVM and not our code. Could look different in Java 20. My only guess where it might come from: The overhead of copyMemory is huge for small arrays. ByteBuffer, LongBuffer and FloatBuffer internally loop for very short arrays. We could try to call super.readLongs() if array is short (same for floats). But before doing this I would like to figure out what exactly slows.

If I could get a profile only from this test or somebody could explain what mmap dir actions are used mainly in those facets.

Nov 16 '22 20:11 uschindler

Code similar like this: https://github.com/openjdk/jdk/blob/37848a9ca2ab3021e7b3b2e112bab4631fbe1d99/src/java.base/share/classes/java/nio/X-Buffer.java.template#L929

The of statement uses a simple copy loop to read the values if length is too short.

We can do the same in our code by calling super to read into array.

It would be worth a try.

Nov 16 '22 20:11 uschindler

I was hoping that JDK people do it in the same way for MemorySegment.

Nov 16 '22 20:11 uschindler

lucene lucene copied to clipboard

MR-JAR rewrite of MMapDirectory with JDK-19 preview Panama APIs (>= JDK-19-ea+23)

lucene
lucene copied to clipboard