lucene
lucene copied to clipboard
MR-JAR rewrite of MMapDirectory with JDK-19 preview Panama APIs (>= JDK-19-ea+23)
INFO: This is a followup of #518: It's the same code base, but with API changes from JDK 19 applied
This is just a draft PR for a first insight on memory mapping improvements in JDK 19+.
Some background information: Starting with JDK-14, there is a new incubating module "jdk.incubator.foreign" that has a new, not yet stable API for accessing off-heap memory (and later it will also support calling functions using classical MethodHandles that are located in libraries like .so or .dll files). This incubator module has several versions:
- first version: https://openjdk.java.net/jeps/370 (slow, very buggy and thread confinement, so making it unuseable with Lucene)
- second version: https://openjdk.java.net/jeps/383 (still thread confinement, but now allows transfer of "ownership" to other threads; this is still impossible to use with Lucene.
- third version in JDK 16: https://openjdk.java.net/jeps/393 (this version has included "Support for shared segments"). This now allows us to safely use the same external mmaped memory from different threads and also unmap it! This was implemented in the previous pull request #173
- fourth version in JDK 17: https://openjdk.java.net/jeps/412 . This mainly changes the API around the scopes. Instead of having segments explicitely made "shared", we can assign them to some resource scope which control their behaviour. The resourceScope is produced one time for each IndexInput instance (not clones) and owns all segments. When the resourceScope is closed, all segments get invalid - and we throw
AlreadyClosedException
. The big problem is slowness due to heavy use of new instances just to copy memory between segments and java heap. This drives garbage collector crazy. This was implemented in previous PR #177 - fifth version in JDK 18, included in build 26: https://openjdk.java.net/jeps/419. This mainly cleans up the API. From Lucene's persepctive the
MemorySegment
API now hasSystem.arraycopy()
-like APIs to copy memory between heap and memory segments. This improves speed. It also handles byte-swapping automatically. This version of the PR also usesValueLayout
instead of varhandles, as it makes code more readable and type-safe. - sixth version in JDK 19, included with build 23: https://openjdk.java.net/jeps/424 (actual version). This version moves the whole incubation API as a stable "preview API". Some classes were renamed and bugs (e.g. on Windows with huge mappings) were resolved.
This new preview API more or less overcomes several problems:
- ByteBuffer API is limited to 32bit (in fact MMapDirectory has to chunk in 1 GiB portions)
- There is no official way to unmap ByteBuffers when the file is no longer used. There is a way to use
sun.misc.Unsafe
and forcefully unmap segments, but any IndexInput accessing the file from another thread will crush the JVM with SIGSEGV or SIGBUS. We learned to live with that and we happily apply the unsafe unmapping, but that's the main issue.
@uschindler had many discussions with the team at OpenJDK and finally with the third incubator, we have an API that works with Lucene. It was very fruitful discussions (thanks to @mcimadamore !)
With the third incubator we are now finally able to do some tests (especially performance).
The code basically just modifies MMapDirectory
to use LONG instead of INT for the chunk size parameter. In addition it adds MemorySegmentIndexInput
that is a copy of our ByteBufferIndexInput
(still there, but unused), but using MemorySegment instead of ByteBuffer behind the scenes. It works in exactly the same way, just the try/catch blocks for supporting EOFException or moving to another segment were rewritten.
It passes all tests and it looks like you can use it to read indexes. The default chunk size is now 16 GiB (but you can raise or lower it as you like; tests are doing this). Of course you can set it to Long.MAX_VALUE, in that case every index file is always mapped to one big memory mapping. My testing with Windows 10 have shown, that this is not a good idea!!!. Huge mappings fragment address space over time and as we can only use like 43 or 46 bits (depending on OS), the fragmentation will at some point kill you. So 16 GiB looks like a good compromise: Most files will be smaller than 6 GiB anyways (unless you optimize your index to one huge segment). So for most Lucene installations, the number of segments will equal the number of open files, so Elasticsearch huge user consumers will be very happy. The sysctl max_map_count may not need to be touched anymore.
In addition, this implements readLongs
in a better way than @jpountz did (no caching or arbitrary objects). The new foreign-vector APIs will in future also be written with MemorySegment in its focus. So you can allocate a vector view on a MemorySegment and let the vectorizer fully work outside java heap inside our mmapped files! :-)_
It would be good if you could checkout this branch and try it in production.
According to speed tests it should be as fast as MMAPDirectory, partially also faster because less switching between byte-buffers is needed. With recent optimizations also long
-based absolute access in loops should be faster.
But be aware:
- You need JDK 17 to run Gradle (set
JAVA_HOME
to it) - You need JDK-19-ea+23 to compile (set
JAVA19_HOME
to it) - You need JDK-19-ea+23 to test (set
RUNTIME_JAVA_HOME
to it). You can still test with any older version, so RUNTIME_JAVA_HOME is optional, but then it will useMappedByteBuffer
- The JAR file is a multi-release JAR (MR-JAR)
- Also you need to add
--enable-preview
to the command line of your Java program/Solr server/Elasticsearch server. If you do not add this option, Lucene will silently use the old byte buffers. It will also print a warning. At moment, when Java 19 preview mode was detected, it will also print a log message stating so.
It would be good to get some benchmarks, especially by @rmuir or @mikemccand. Take your time and enjoy the complexity of setting this up! ;-)
I updated the PR to use an MR-JAR (multi release JAR) to implement the new functionality. It also uses Gradle's toolchain support to automatically download the JDK 19 release just to compile the MR part. This is commented out at the moment, because the Gradle toolchain API does not support EA releases.
The cool thing: Once OpenJDK 19 got out of the door in September, we can enable the toochanin support an then also backport this patch to Lucene 9.x.
The MR-JAR part works the following way:
- The
MappedByteBuffer
part and all the Java 9+ hacks with Unsafe are moved to a new pkg-private interfaceMMapIndexInputProvider
. - In the MR-JAR part for Java 19 there's a separate implementation of
MMapIndexInputProvider
, compiled with--enable-preview
- the main
MMapDirectory
has a static initializer that tries to load and link the Java19MemorySegmentIndexInputProvider
. It uses method handles for that because reflection would needsetAccessible()
. If loading fails it logs a warning if Java 19 is used, but--enable-preview
is missing on Java's command line. In all cases it falls back to the defaultMappedByteBufferIndexInputProvider
implementation in the main part of JAR file.
The detection logic is here: https://github.com/apache/lucene/blob/a74055870de529cb14793a863ae84fd53795d517/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L325-L364
I also added distribution test that checks if the built JAR file is a correct MR-JAR and it also works with module system.
The current PR still requires Java 19 as RUNTIME_JAVA_HOME
because of missing toolchain support. To test the "legacy" byte buffer implementation, one can comment the --enable-preview
from the test's JVM args.
Hi @dweiss: Could you have a look at what I am planning, especially the MR-JAR and how I intend to use the Gradle toolchain API to compile against Java 19 in the future?
If you know a solution how to use EA releases with Toolchain API, tell me. I gave up on this. The URLs for AdoptOpenJDK are hardcoded in Gradle an theres no way to select EA releases in the JvmSpec, see: https://github.com/gradle/gradle/issues/14515 and https://github.com/gradle/gradle/issues/14814
It is now possible to compile and run tests with different JVMs:
If RUNTIME_JAVA_HOME
does not point to exactly Java 19, it won't compile and give error message. This is needed until we have toolchain support.
To compile, you may pass JAVA19_HOME
(this is what Jenkins does). You can then still randomize and run the tests against Java 17 (default, no RUNTIME_JAVA_HOME
) or any other Java version. The compile JVM must be a Java 19 version (exact, otherwise compilation will fail, too).
By this Policeman Jenkins is now doing a full tets of all combinations and makes sure the MR-JAR and the provider approach used works on all platforms.
Hi Uwe. I recall I had problems with toolchains - some things just refused to work (or we couldn't customize them? Can't remember precisely what the problem was). I looked at the code you wrote and it seems all right at a glance. Some things will be awkward, like "rewinding" the javaHome set in other places... perhaps it'd be good to not set it there at all (move the toolchain declaration and runtime java selection into a single script) - can't say what will turn out to be better.
Hi Dawid, Thanks for feedback. Indeed the Tool chain API is very limited and in my honest opinion far from useful for Lucenes usages. There's no way to configure anything inside the Gradle files. Things like Autodownload or provider urls or locations of preinstalled jdks like on Jenkins cannot be put into the build files. It is also not possible to tell the system to download EA releases. It is all a desaster. I will tomorrow start a Twitter thread about this.
The tool chain API should be flexible and allow plugins to configure own repository types. In general the tool chains should be standard configurations and dependencies in Gradle. The way how it is now is a dead end: inflexible, buggy and looks like written by a student as bachelor thesis in isolation.
I would wish to use a configuration "toolkit" like any other and add the dependency to the java version there. Compile tasks would then use it automatically. Plugins can implement repositories in same way like Maven or Ivy. E.g., one plugin for adoptopenjdk/ eclipse and another one for EA releasesbof openjdk.
About your comments: I wanted to keep the tool chain JVM hacks at one place because they currently only affect Lucene Core. Once we have a release of Java 19 we can simplify that (see code commented out). Other tasks than JavaCompile
and Jar
are not affected. Javadocs are not built from custom source sets and ecj/forbiddenapis/errorprone do not work at all, so I disabled them. It is just 2 classes to compile which are package private. If we get more MR-JAR stuff in the future, we may think of better ways. The current approach is a separate "singleton" in whole build. Maybe you can make it generic, but the combination with EA releases makes this impossible at moment.
On top of that, preview APIs are another big problem. The class files generated have a special minor version in class file format marking them as preview. This makes class loader picky. The code here won't run with Java 20, we will need a separate MR-JAR section for 20 and then for final release of panama in LTS 21 with unmarked class files. If the APIs will not change we may simplify this by removing the minor class version by patching class file version to remove preview flag and only have the java 19 class files (it's a single byte at fixed position, not even ASM is needed).
You may read more about preview APIs in the jep, it's a mess to maintain: https://openjdk.java.net/jeps/12
At moment I also tried this PR with Luke:
- if you start Luke with Java 17, all works as before, it won't see new classes and loads the byte buffer impl
- if you start with Java 19, it will find the new classes, but class loader will reject to load them. Lucene will print a warning (you see it in Luke's log window and console). It will use byte buffer impl then, so all stays as is.
- if you start Luke with Java 19 and
--enable-preview
, it will use new impl. So this is working only with opt in, which is good with preview APIs. - if you start with a possible future Java 20, with or without preview enabled, it will reject the class files and falls back to our old byte buffer impl. In that case it logs warning to update Lucene. 😉
So this looks like working good and people can optionally test new mmap with Java 19 if they opt in (Solr or Elasticsearch).
Thanks for the explanation, Uwe. In the meantime I recalled what it was about toolchains that was a showstopper - indeed it was the possibility to specify a toolchain pointing at a particular (local) installation - such as a custom-compiled JDK. I couldn't find a way to do it. If you take a look at alternative-jdk-support, it even has a comment about it:
// I failed to set it up leveraging Gradle's toolchains because
// a toolchain spec is not flexible enough to provide an exact location of the JVM to be used;
// if you have two identical JVM lang. versions in auto-discovered JVMs, an arbitrary one is used (?).
// This situation is not uncommon when debugging low-level stuff (hand-compiled JVM binaries).
Eh.
Thanks. Here we only need to compile against a specific exact version (19), no testing nothing (because user has JDK 17 or 11 when building lucene). So for the MR-JAR task this should be fine with auto-provisioning of JDK 19.
Testing of the Java 19 parts is done by Jenkins. Normal developer won't need that (unless directly working on the MR-JAR parts). We may think about running some very basic tests on the MMapDir with the autoprovisioned JVM, but for global tests: It's Jenkin's turn!
Testing MR-JAR files is still some complex issue.
Thanks @mocobeta, I just need to not forget to reenable github jobs in September when this gets merged (at this time, we can also use Autoprovision of JDK 19).
JDK 19 was released, I am working on the Toolchain support to support the compilation of the MR-JAR. At moment, the code commented out does not yet work, as AdoptOpenJDK / Temurin did not do a release yet: https://github.com/adoptium/adoptium/issues/171, https://github.com/adoptium/adoptium/issues/170
Current output:
Starting a Gradle Daemon (subsequent builds will be faster)
Directory 'C:\Users\Uwe Schindler\.gradle\daemon\7.3.3\(custom paths)' (system property 'org.gradle.java.installations.paths') used for java installations does not exist
> Task :errorProneSkipped
WARNING: errorprone disabled (skipped on builds not running inside CI environments, pass -Pvalidation.errorprone=true to enable)
> Task :lucene:core:compileMain19Java FAILED
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for task ':lucene:core:compileMain19Java'.
> Error while evaluating property 'javaCompiler' of task ':lucene:core:compileMain19Java'
> Failed to calculate the value of task ':lucene:core:compileMain19Java' property 'javaCompiler'.
> Unable to download toolchain matching these requirements: {languageVersion=19, vendor=ADOPTOPENJDK, implementation=vendor-specific}
> Unable to download toolchain. This might indicate that the combination (version, architecture, release/early access, ...) for the requested JDK is not available.
> Could not read 'https://api.adoptopenjdk.net/v3/binary/latest/19/ga/windows/x64/jdk/hotspot/normal/adoptopenjdk' as it does not exist.
OpenJDK 19 was released by Eclipse Temurin: https://adoptium.net/de/temurin/releases?version=19
Thanks @gdams and @adoptium.
@msokolov: I will merge this to main, 9.x and 9.4 after adding changes.txt. When building the release you may need to remove your gradle.properties file in Lucene's directory. I will post a message.
@uschindler Just curious - did you get any feedback re. performance? Thanks
Hi @mcimadamore
@uschindler Just curious - did you get any feedback re. performance? Thanks
No official feedback yet. But Elasticsearch and Opensearch seem to benchmark. Unfortunately end-users have not yet used it on their side, because no release of Elasticserach/Opensearch using Lucene 9.4 was done yet. So we may only get feedback from pure Lucene users and there are not many.
For Apache Solr, the latest release was also without Lucene 9.4. In former times this ws easier, because at least Lucene and Solr were released together, but that's no longer the case. So we have to wait.
I will ask @mikemccand to at least enable --enable-preview
on the nightly pure Lucene benchmark by default (and use JDK 19).
I will ask @mikemccand to at least enable --enable-preview on the nightly pure Lucene benchmark by default (and use JDK 19).
Ack -- I'll enable this starting from tonite's run.
@mikemccand Maybe do this in a 2 step process:
- First update to Java 19, so we get a clean state
- Then add
--enable-preview
If we do both at same time, we won't see a difference between old and new Lucene MMAP (on same version). A JDK upgrade may also change other performance numbers.
If we do both at same time, we won't see a difference between old and new Lucene MMAP (on same version). A JDK upgrade may also change other performance numbers.
Ack -- I turned on just the JDK 19 upgrade last night!
PKLookup
looks a bit happy about the upgrade, but otherwise I don't see other tasks showing much impact.
I'll leave this for a few days, then turn on --enable-preview
.
If we do both at same time, we won't see a difference between old and new Lucene MMAP (on same version). A JDK upgrade may also change other performance numbers.
Ack -- I turned on just the JDK 19 upgrade last night!
PKLookup
looks a bit happy about the upgrade, but otherwise I don't see other tasks showing much impact.I'll leave this for a few days, then turn on
--enable-preview
.
OK, looks fine thanks!
Once you switched --enable-preview
on, you should no longer see stack traces like this in the profile:
1.84% 209556 jdk.internal.misc.Unsafe#convEndian()
at jdk.internal.misc.Unsafe#getIntUnaligned()
at jdk.internal.misc.ScopedMemoryAccess#getIntUnalignedInternal()
at jdk.internal.misc.ScopedMemoryAccess#getIntUnaligned()
at java.nio.DirectByteBuffer#getInt()
at java.nio.DirectByteBuffer#getInt()
at org.apache.lucene.store.ByteBufferGuard#getInt()
at org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl#readInt()
They should be visible as calls from MemorySegmentIndexInput
to MemorySegment.get()
; ByteBufferIndexInput
calling DirectByteBuffer
and so on should diappear.
Please do not wonder why "convEndian" is always on top of the call stack: The method is always called, although the endianness fits. It is just a NOOP on x86.
After Mike switched to preview mode the results look good. The speed with MemorySegmentIndexInput ist similar to old ByteBuffer code.
https://home.apache.org/~mikemccand/lucenebench/
See change EZ. Of course what we can't test is highly concurrent reopening of index and therefore closing shared MemorySessions at high ratio.
CC @mcimadamore
@uschindler FYI it looks like this change made indexing ~12% faster with BEST_SPEED
on the stored fields benchmark: http://people.apache.org/~mikemccand/lucenebench/stored_fields_benchmarks.html
@uschindler FYI it looks like this change made indexing ~12% faster with
BEST_SPEED
on the stored fields benchmark: http://people.apache.org/~mikemccand/lucenebench/stored_fields_benchmarks.html
That is interesting. I did not look at indexing, because we write there. I think this is because of maybe bulk merging was faster, am I correct? The new code is possibly cheaper with copying large parts from/to heap.
I haven't dug, but your reasoning could explain why we are seeing a similar speedup across both BEST_SPEED
and BEST_COMPRESSION
since they benefit from the bulk merging optimization in a similar way.
It does not fully explain it, because actually the copy of memory between file and heap is both using Unsafe#copyMemory
. The only difference is that for ByteBuffer there is some hardcoded limit of 8 or 16 bytes (not exactly sure, have to look it up). If the byte[] not larger than it uses a copy-loop. MemorySegment does not have this loop. This may explain that some vector stuff got a bit slower, because when you copy a very short long[]
from offheap to heap, it will just read those longs in a sequence and not use Unsafe#copyMemory
which has overhead for small arrays.
In future it would be great if Panama would provide a method in OutputStream
like stream.writeBytes(MemorySegment segment, long offset, long count)
. This would allow to not copy at all and use a dd
like approach. We could then add IndexInput#copyTo to our API. I think we worked on something similar, but for copying a complete off-heap approch would be best.
one issue with "fancy" copies (e.g. for bulk merging) is that we can't make them super-fancy anyway (e.g. zero-copy from fd to fd) as we still have to calculate CRC32 checksum.
Yes you're right!
Maybe at some point JDK allows to call Checksum#update(MemorySegment,...)
. Then we don't need to copy multiple times, but the checksum will be calculated using the off-heap slice.
I have the feeling that the JDK people are planning to add MemorySegment
at more places in Java's public API. At least everywhere where I/O is involved and bulk pocessing of Memory is required. They started already for vectors.
@uschindler It looks like this benchmark has a noticeable slowdown with this change? https://home.apache.org/~mikemccand/lucenebench/MedTermDayTaxoFacets.html
Could be. It may be caused by too much coping of very small arrays (especially float and long) between offheap and heap. For explanation see above.
I don't think this is a huge problem because it mainly comes from missing optimizations in the JVM and not our code. Could look different in Java 20. My only guess where it might come from: The overhead of copyMemory is huge for small arrays. ByteBuffer, LongBuffer and FloatBuffer internally loop for very short arrays. We could try to call super.readLongs() if array is short (same for floats). But before doing this I would like to figure out what exactly slows.
If I could get a profile only from this test or somebody could explain what mmap dir actions are used mainly in those facets.
Code similar like this: https://github.com/openjdk/jdk/blob/37848a9ca2ab3021e7b3b2e112bab4631fbe1d99/src/java.base/share/classes/java/nio/X-Buffer.java.template#L929
The of statement uses a simple copy loop to read the values if length is too short.
We can do the same in our code by calling super to read into array.
It would be worth a try.
I was hoping that JDK people do it in the same way for MemorySegment.