Create Java version
We currently lack a Java version for Java users.
Hi @lemire Sure, would love to work on it. Matter of the fact is that it's gonna be much more productive if I do it for the entire Java community. So can you suggest exactly what are you expecting to be done? A little more explanation about your requirements would really help me to create a Java wrapper for simdjson. I'll be happy to help.
@TkTech : can you share your thoughts? You are working on a couple of simdjson wrappers...?
Is anyone working on this?
@richardstartin Not to my knowledge. All I could find was this empty repo:
https://github.com/laingke/simdjson-java
I am working with @ioioioio who is busy on the Rust wrapper...
@lemire I don't see anything in "stage 2" (unified_machine) which can't be done as well in Java as it can natively, whereas stage 1 (find_structural_bits), as far as I can tell, is the brilliant part that couldn't be written in Java. I think it might make sense to wrap stage 1 with JNI, but port stage 2 to Java. Do you have any reservations about this approach?
@richardstartin
One concern regarding Java is the fact that it relies on UTF-16. Actual JSON content on the net is UTF-8. So there is a UTF-8 to UTF-16 bridge needed somewhere. (My understanding is that Swift started out UTF-16 but they have migrated to UTF-8.) That's not a big problem, of course.
stage 1 (build_tape), as far as I can tell, is the brilliant part that couldn't be written in Java
It evidently can be written in Rust, as @Licenser showed. In Java, I would think not...
Note that we are in the process of making it fully cross-platform, with support for ARM NEON and such.
I don't see anything in "stage 2" (unified_machine) which can't be done as well in Java as it can natively
It is certainly interesting to just do stage 1, get back and index and do the processing that is contained in Stage 2 within Java. It is almost certain that the result of such work would be useful.
I would not sell Stage 2 short, however.
Stage 2 is a goto machine, it uses neat tricks for string parsing and number parsing.
Look at the string parsing...
https://github.com/lemire/simdjson/blob/master/include/simdjson/stringparsing.h
Number parsing...
https://github.com/lemire/simdjson/blob/master/include/simdjson/numberparsing.h
Look at how we check for 'true':
bool is_valid_true_atom(const uint8_t *loc) {
uint64_t tv = *reinterpret_cast<const uint64_t *>("true ");
uint64_t mask4 = 0x00000000ffffffff;
uint32_t error = 0;
uint64_t locval; // we want to avoid unaligned 64-bit loads (undefined in C/C++)
// this can read up to 7 bytes beyond the buffer size, but we require
// SIMDJSON_PADDING of padding
static_assert(sizeof(uint64_t) - 1 <= SIMDJSON_PADDING);
std::memcpy(&locval, loc, sizeof(uint64_t));
error = (locval & mask4) ^ tv;
error |= is_not_structural_or_whitespace(loc[4]);
return error == 0;
}
So there is a lot of "easy in C++" tricks that are not so easy to get right in Java.
@richardstartin But, yes, completing stage 1 and getting, in Java, an index into the document, with UTF-8 validation done, would be nice.
Hi can i work on this issue? If yes, are we planning on rewritting the entire library in java or calling the functions from java using Panama
@lemire in gson(write by google),there is a class named "com.google.gson.stream.JsonReader". it has some basic function for json like peek,beginArray,endArray,beginObject,endObject,hasNext and so on. i overwrite this class,and use similar function in simdjson to rewrite these basic operation. it works well,but i found it has no advantage in performance from my test data. the deserializition of json in java need to use reflection,it cost much. so i think java wrapper will not improve a lot in performance. what do you think of my opinion .
@361442342 A better strategy might be to push down the computation to C++ (simdjson) and only recover, in Java, what you need.
@361442342 bottlenecking between Java and C++ is a real concern. There is overhead in each call across the boundary, and there is more overhead if you're translating objects between one and the other. Linking your code might help, though!
The function of obtaining data objects based on json must be completed in JNI, otherwise there will be no revenue. There are two methods: 1. Code generation tool to generate JNI C code based on Java objects. 2. Traverse the json to find the java object members in reverse.
The On Demand front-end is relevant here.
some thoughts:
regarding UTF-8
One concern regarding Java is the fact that it relies on UTF-16. Actual JSON content on the net is UTF-8. So there is a UTF-8 to UTF-16 bridge needed somewhere. (My understanding is that Swift started out UTF-16 but they have migrated to UTF-8.) That's not a big problem, of course.
The openjdk is currently working on migrating to UTF-8 by default, its only a matter of time https://bugs.openjdk.java.net/browse/JDK-8260266
regarding the binding overhead
bottlenecking between Java and C++ is a real concern. There is overhead in each call across the boundary, and there is more overhead if you're translating objects between one and the other. Linking your code might help, though!
The foreign linker api from openjdk 16 might reduce the overhead: I recently read a comment from the panama lead architect about the performance of the new forein linker API https://github.com/bytedeco/javacpp/issues/453
Performance-wise, while the foreign linker does not speed up calls from Java to native, compared with JNI (**), there are some cases in which the new API fares quite well, and that's when native function accept/return parameter by value. In such cases, depending on how return/pass by value is emulated in JNI, the difference with the Foreign Linker API (which supports passing by value natively) can be non-negligible.
(**) in case of back to back native calls, or calls to native code that is very short-lived, the Foreign Linker API supports an invocation mode which avoids Java to native state transitions, which can be useful to reduce latency (but should be used with caution, as it has the potential to crash the VM - e.g. if the function calls back to Java). We are also investigating optimizations for upcalls (e.g. from native back to Java), and we're more confident there that there is more room for improvement (the code to go from Java to native is well optimized in JNI, but the same cannot be said for the other way around).
I have no idea if this could improve performance with the bindings but that might be worth exploring. The lowest overhead binding might be to use the sun.misc.Unsafe intrinsics directly hence bypassing the JNI. This approach is used by jetbrains here: https://github.com/JetBrains/jsitter
finally, if JNI overhead is acceptable, https://github.com/bytedeco/javacpp seems to be the state of the art for binding java to c++ if the surface is only in C, the openjdk tool jextract could be used, which could automate the binding generation!
regarding an eventual java only implementation
here are some useful resources: https://github.com/bytedeco/javacpp/issues/402
Java 9 adds the @HotSpotIntrinsicCandidate annotation. Methods annotated with that annotation will have intrinsics for them. intrinsics functions are direct call functions that inline java calls as direct assembly
The key thing is that java is getting SIMD support next month! https://openjdk.java.net/jeps/338 https://software.intel.com/content/www/us/en/develop/articles/vector-api-developer-program-for-java.html
tornadoVM is also quite fascinating https://github.com/beehive-lab/TornadoVM
A lot depends on what you want to support as the stable public API for the simdjson library. The cleanest language integration may be to make the tape formats part of the stable API, such that the integration has only to expose a C call that takes three buffers (raw json input bytes, plus two empty ones for the output tapes) and invokes the parser accordingly. The Java layer is then responsible for the buffer memory management and only one expensive JNI/panama call is needed per file/document, whilst the more numerous fine-grained operations to iterate the tape can stay entirely in fairly simple Java code and enjoy the usual JIT optimizations. Using the tape format as a language-neutral binary json encoding or cache format may also be an interesting direction.
Keeping stage1 in native code whilst stage2 is in Java presents a more challenging approach for code maintenance and version synchronization, without any clear advantage. Perhaps you can reimplement stage2 in Java, but why would you want to? It's bad enough to have to do it once :-) Only makes sense if you're aiming for a 100% Java reimplementation of the solution rather than an integration. The on-demand API is also architecturally harder, as it means crossing the JNI boundary more frequently. It may pay off, but my suspicion is the lazy approach is less compelling than it would be for C++ projects.
'UTF-8 by default' is a red herring. Java strings internally are either byte[] (if ascii, as it's more compact in memory) or char[] (UTF-16) but that's an implementation detail of java.lang.String. It has constructors for both, but crucially will always memcpy the origin data anyhow, so it's going to hurt equally either way from a performance perspective. The 'by default' part is just about what the assumed charset is if you supply a byte[] without an explicit encoding, which doesn't matter here as the integration code can always be explicit. If using the tape-API approach, you'd lazily create String or CharSequence instances over parts of the strings tape. In the former case unfortunately you pay for encoding validation again even though it's not needed, because the language runtime won't trust you already did it right.
'@HotSpotIntrinsicCandidate' is likewise irrelevant, it's part of the internals of the standard libraries that ship with JVM runtime and is not applicable to user code. You can't supply your own intrinsic implementation, which conceptually would be similar to inlining assembler in C source code. That's a Good Thing for the platform overall, if somewhat frustrating.
JEP-338 is more interesting. Is exposes SIMD functions through Vector and VectorMask objects, making the vector-parallel parts of the implementation possible from Java for the first time. However, there are boundary issues in getting the data in and out of that API efficiently, particularly with regard to treating results in a numeric rather than Object way, such that you can do the subsequent bit-parallel operations on them, even assuming the bit-parallel operation you want is available in Java (hint: it's not). The mison approach of vector use for structural character search, followed by loop/branch processing, is made possible with the new Java vector API, but the current simdjson approach is not.
My feeling is that eventually some fast parsing (though not necessarily explicitly of JSON) using SIMD is likely to be an implementation detail of the JVM's string and number handling internals and implemented as intrinsics (i.e. c/c++) rather than using the Java vector API. The UTF-8 validation or float parsing techniques for example are fairly modular could make their way into the standard library implementation, which would pay off for all use cases, coincidentally including existing Java JSON parsers.
@jhalliday I share your sentiment.
Porting the float parsing routine to Java is certainly something that is doable... https://github.com/fastfloat/fast_float/issues/58 It was ported to C# https://github.com/carlVerret/csFastFloat with good results. Having the code in your own language is always nicer, everything else being equal.
Many of the "dirty tricks" like our stupidly fast UTF8 validation could be integrated within Java.
I've been thinking a lot about bindings lately ... given a large overhead for crossing runtime boundaries (as well as the impossibility of cross-boundary optimization), it might make the most sense to:
- use the simdjson tokenization backend (utf-8 parser, simd stuff)
- expose the array of indexes to Java
- implement the On Demand frontend in Java
The approach has proven itself really well in C++, and I think now that the trail is blazed building other bindings might not be terrible to write (though certainly harder than just naked calls to the backend, since you have to implement logic).
Most of the inlining and optimization opportunities for Java are in the frontend, anyway.
Depending on Java's StringBuffer API, I think we could avoid a double string allocation, too.
@jkeiser Though it is probably quite a bit of work, your scenario does sound entirely doable in a few weeks of work.
Note that it naturally extends to C#... @EgorBo ... and maybe other programming languages.
@AugustNagro Ported the UTF8 validation to pure Java with vectors...
https://github.com/AugustNagro/utf8.java
Their README is not entirely clear to me, but it seems that the result is negative for the time being (that is, the net result is slower). I have not looked at the code at all.
I am sure that @AugustNagro is a great hacker, but it is not impossible that a thorough code review could help.
cc @richardstartin @jkeiser
I wish it were faster too & welcome code review from anyone interested. There really isn't much code. Hopefully I can get away with saying it's the vector api's fault, and not mine.
I don't doubt that someone could implement the whole of simd-json with jdk.incubator.vector, but will it be fast? My loosely-held opinion is: probably not for a while.
@AugustNagro What you did is great.
If someone could just drop the JIT assembly output (asm code) then we could tell quickly for sure where the fault lies.
@lemire Paul Sandoz from Oracle is taking a look, and has made some interesting findings!
https://mail.openjdk.java.net/pipermail/panama-dev/2021-March/012355.html
@AugustNagro Thanks for sharing the link. It does seem that the issue might be in the compiler stack. That is to be expected given how new this technology is.
I ran @AugustNagro 's benchmark with the current build of Vector API. The perfomance seems to have improved since then, https://github.com/AugustNagro/utf8.java/pull/3 . Also, exploring perfomance improvements at https://github.com/AugustNagro/utf8.java/pull/4 .
As of Java 18 UTF-8 is the default: https://inside.java/2021/10/04/the-default-charset-jep400/
@JohannesLichtenberger But Java will still store strings as UTF-16 though, right?
@cl4es
According to Cl4es blog
all Strings that can be will be represented internally in a ISO-8859-1 binary format (latin-1).
if a string contain >= 1 character that is non-ASCII, it will fallback to UTF-16 (not character wise) if I understand correctly.
In JDK 18, cl4es significantly optimized charset encoders, by making e.g UTF 8 use a new intrinsic (for ASCII compatible strings) which make it nearly as fast as latin-1 encoding (the fastest encoder). Unfortunately, it looks like UTF 16 encoding is among the slowest and has not been ported to use the new intrinsic?
so the problems are: UTF 16 is not intrinsified 2) for non-stritctly ASCII strings, UTF 16 is in itself slower than UTF 8 (and less memory efficient if I recall correctly) 3) the external world assume UTF-8 and this is the direction of JEP 400 but internally the JDK will still needlessly transcode UTF 8 inputs into UTF 16 (if it cannot be transcoded into latin-1)? Even the UTF-8 to latin 1 translation seems like needless overhead. What are your thoughts on this @cl4es ? BTW I have zero expertise in this topic and might be saying multiples erroneous statements.
Yes, String remains effectively UTF-16-encoded, with an implementation detail to encode using ISO-8859-1 when possible.
Representing Strings using UTF-8 internally has been brought up a few times. Getting there would be a much larger endeavor than changing the default encoding as proposed by JEP-400. And while decoding/encoding to/from UTF-8 would be a bit (or possibly a lot) faster, it's might not be as clear-cut from a performance point-of-view as you probably think.
First off you'd likely pessimize a number of String operations - such as charAt - which are unfortunately heavily depended on in some rather performance-sensitive places. A lot of internal and third-party APIs would have to be re-examined, or face heavy regressions as instant O(1) lookups turn into O(n) scans.
Had we started from a blank slate and a String that doesn't leak implementation details via indexed accessors (instead pushes users towards iterators or streams) then UTF-8 would have been a much more obviously good choice. With all the legacy code out there today I'm not convinced it'll be worth the disruption to re-imagine the internal representation yet again.
(UTF-16-encoded Strings do see a lot of help from various intrinsics in the OpenJDK. But it does use more memory, which means microbenchmarks will show throughput in many operations involving UTF-16 Strings stay behind roughly at pre-JDK 9 levels. There might be a couple of minor opportunities to improve handling of UTF-16 in the encode/decode area.)
@cl4es A derived project from simdjson is simdutf. It provides really fast UTF-8 to UTF-16 transcoding as well as a few other nice functions:
https://github.com/simdutf/simdutf
It is portable (NEON, AVX, SSE...):
- Daniel Lemire, Wojciech Muła, Transcoding Billions of Unicode Characters per Second with SIMD Instructions, Software: Practice and Experience
It is seriously tested and so forth. It could be used in other runtimes.
We do not have UTF-16 to latin1, but that's something we could do. Our code is really fast, please see our benchmarks.
You might also enjoy the following Java library:
https://github.com/AugustNagro/utf8.java
It implements the fast UTF-8 validator found in simdjson.
@lemire interesting, thanks. I don't have any experience integrating third party code into the OpenJDK, though, but if you want to collaborate on contributing something I'd be happy to facilitate. I'm not sure we could make good use of the C++ implementation, though.
I'm also not sure the Java library is something we could integrate into String as it stands: The Vector API needs to be finalized first of all, and even after it goes final then having a direct dependency from String on the Vector API might open us up to various bootstrapping issues. Could perhaps be solved by having a mechanism to replace the code at runtime at some point after a primordial String impl and the Vector API has first been initialized, then let that go through JIT optimization. VM intrinsics implemented in Java..! We do code replacement elsewhere, e.g. when starting JFR. It might trigger some deoptimization and a bit of a drag on startup, but shouldn't be impossible with some time and effort..
@cl4es Sure integrating this library or similar code through the Vector API would be easier but might require to wait for stabilization. However it's always possible to port the UTF-8 algorithm to classical JVM intrinsics such as recently done in Java/util/Base64 by @asgibbons @sviswanathan
A classical JVM intrinsic could work great, yeah, but porting to HotSpot intrinsics will likely take significant effort. I don't have the expertise to do it, and those I know at Oracle who might have a rather busy schedule ahead. If someone is keen on getting this under consideration I think a community effort to prototype it might have the best chance to succeed. I'd be happy to help out as best I can with such an effort.
The base64 codec was obviously a lot of work (@asgibbons would know), but the UTF-8 validator is actually simpler... So it seems like it is within striking distance.
Doing UTF-8 <=> UTF-16 transcoding would be more work.
@LifeIsStrange Fun fact: the Java/util/Base64 code by @asgibbons is based on a routine from our paper Base64 encoding and decoding at almost the speed of a memory copy (with @WojciechMula). It is the same @WojciechMula from the Transcoding Billions of Unicode Characters per Second with SIMD Instructions paper quoted above.
(This got me very excited because I did not expect anyone to care enough about doing base64 at crazy speeds... to do the hard work needed to push it inside the JDK!!! Very exciting.)
Maybe the Vector API also got much more stable in the meantime?