simdjson icon indicating copy to clipboard operation
simdjson copied to clipboard

Create Java version

Open lemire opened this issue 7 years ago • 36 comments

We currently lack a Java version for Java users.

lemire avatar Mar 14 '19 19:03 lemire

Hi @lemire Sure, would love to work on it. Matter of the fact is that it's gonna be much more productive if I do it for the entire Java community. So can you suggest exactly what are you expecting to be done? A little more explanation about your requirements would really help me to create a Java wrapper for simdjson. I'll be happy to help.

pra-bhu avatar Mar 27 '19 21:03 pra-bhu

@TkTech : can you share your thoughts? You are working on a couple of simdjson wrappers...?

lemire avatar Mar 27 '19 21:03 lemire

Is anyone working on this?

richardstartin avatar May 24 '19 18:05 richardstartin

@richardstartin Not to my knowledge. All I could find was this empty repo:

https://github.com/laingke/simdjson-java

I am working with @ioioioio who is busy on the Rust wrapper...

lemire avatar May 24 '19 19:05 lemire

@lemire I don't see anything in "stage 2" (unified_machine) which can't be done as well in Java as it can natively, whereas stage 1 (find_structural_bits), as far as I can tell, is the brilliant part that couldn't be written in Java. I think it might make sense to wrap stage 1 with JNI, but port stage 2 to Java. Do you have any reservations about this approach?

richardstartin avatar May 26 '19 15:05 richardstartin

@richardstartin

One concern regarding Java is the fact that it relies on UTF-16. Actual JSON content on the net is UTF-8. So there is a UTF-8 to UTF-16 bridge needed somewhere. (My understanding is that Swift started out UTF-16 but they have migrated to UTF-8.) That's not a big problem, of course.

stage 1 (build_tape), as far as I can tell, is the brilliant part that couldn't be written in Java

It evidently can be written in Rust, as @Licenser showed. In Java, I would think not...

Note that we are in the process of making it fully cross-platform, with support for ARM NEON and such.

I don't see anything in "stage 2" (unified_machine) which can't be done as well in Java as it can natively

It is certainly interesting to just do stage 1, get back and index and do the processing that is contained in Stage 2 within Java. It is almost certain that the result of such work would be useful.

I would not sell Stage 2 short, however.

Stage 2 is a goto machine, it uses neat tricks for string parsing and number parsing.

Look at the string parsing...

https://github.com/lemire/simdjson/blob/master/include/simdjson/stringparsing.h

Number parsing...

https://github.com/lemire/simdjson/blob/master/include/simdjson/numberparsing.h

Look at how we check for 'true':

bool is_valid_true_atom(const uint8_t *loc) {
  uint64_t tv = *reinterpret_cast<const uint64_t *>("true    ");
  uint64_t mask4 = 0x00000000ffffffff;
  uint32_t error = 0;
  uint64_t locval; // we want to avoid unaligned 64-bit loads (undefined in C/C++)
  // this can read up to 7 bytes beyond the buffer size, but we require 
  // SIMDJSON_PADDING of padding
  static_assert(sizeof(uint64_t) - 1 <= SIMDJSON_PADDING);
  std::memcpy(&locval, loc, sizeof(uint64_t));
  error = (locval & mask4) ^ tv;
  error |= is_not_structural_or_whitespace(loc[4]);
  return error == 0;
}

So there is a lot of "easy in C++" tricks that are not so easy to get right in Java.

lemire avatar May 26 '19 16:05 lemire

@richardstartin But, yes, completing stage 1 and getting, in Java, an index into the document, with UTF-8 validation done, would be nice.

lemire avatar May 26 '19 16:05 lemire

Hi can i work on this issue? If yes, are we planning on rewritting the entire library in java or calling the functions from java using Panama

balashashanka avatar Apr 24 '20 05:04 balashashanka

@lemire in gson(write by google),there is a class named "com.google.gson.stream.JsonReader". it has some basic function for json like peek,beginArray,endArray,beginObject,endObject,hasNext and so on. i overwrite this class,and use similar function in simdjson to rewrite these basic operation. it works well,but i found it has no advantage in performance from my test data. the deserializition of json in java need to use reflection,it cost much. so i think java wrapper will not improve a lot in performance. what do you think of my opinion .

sunny-shu avatar Apr 24 '20 06:04 sunny-shu

@361442342 A better strategy might be to push down the computation to C++ (simdjson) and only recover, in Java, what you need.

lemire avatar Apr 24 '20 11:04 lemire

@361442342 bottlenecking between Java and C++ is a real concern. There is overhead in each call across the boundary, and there is more overhead if you're translating objects between one and the other. Linking your code might help, though!

jkeiser avatar Apr 24 '20 15:04 jkeiser

The function of obtaining data objects based on json must be completed in JNI, otherwise there will be no revenue. There are two methods: 1. Code generation tool to generate JNI C code based on Java objects. 2. Traverse the json to find the java object members in reverse.

szcnick avatar Apr 26 '20 03:04 szcnick

The On Demand front-end is relevant here.

lemire avatar Feb 01 '21 21:02 lemire

some thoughts:

regarding UTF-8

One concern regarding Java is the fact that it relies on UTF-16. Actual JSON content on the net is UTF-8. So there is a UTF-8 to UTF-16 bridge needed somewhere. (My understanding is that Swift started out UTF-16 but they have migrated to UTF-8.) That's not a big problem, of course.

The openjdk is currently working on migrating to UTF-8 by default, its only a matter of time https://bugs.openjdk.java.net/browse/JDK-8260266

regarding the binding overhead

bottlenecking between Java and C++ is a real concern. There is overhead in each call across the boundary, and there is more overhead if you're translating objects between one and the other. Linking your code might help, though!

The foreign linker api from openjdk 16 might reduce the overhead: I recently read a comment from the panama lead architect about the performance of the new forein linker API https://github.com/bytedeco/javacpp/issues/453

Performance-wise, while the foreign linker does not speed up calls from Java to native, compared with JNI (**), there are some cases in which the new API fares quite well, and that's when native function accept/return parameter by value. In such cases, depending on how return/pass by value is emulated in JNI, the difference with the Foreign Linker API (which supports passing by value natively) can be non-negligible.

(**) in case of back to back native calls, or calls to native code that is very short-lived, the Foreign Linker API supports an invocation mode which avoids Java to native state transitions, which can be useful to reduce latency (but should be used with caution, as it has the potential to crash the VM - e.g. if the function calls back to Java). We are also investigating optimizations for upcalls (e.g. from native back to Java), and we're more confident there that there is more room for improvement (the code to go from Java to native is well optimized in JNI, but the same cannot be said for the other way around).

I have no idea if this could improve performance with the bindings but that might be worth exploring. The lowest overhead binding might be to use the sun.misc.Unsafe intrinsics directly hence bypassing the JNI. This approach is used by jetbrains here: https://github.com/JetBrains/jsitter

finally, if JNI overhead is acceptable, https://github.com/bytedeco/javacpp seems to be the state of the art for binding java to c++ if the surface is only in C, the openjdk tool jextract could be used, which could automate the binding generation!

regarding an eventual java only implementation

here are some useful resources: https://github.com/bytedeco/javacpp/issues/402

Java 9 adds the @HotSpotIntrinsicCandidate annotation. Methods annotated with that annotation will have intrinsics for them. intrinsics functions are direct call functions that inline java calls as direct assembly

The key thing is that java is getting SIMD support next month! https://openjdk.java.net/jeps/338 https://software.intel.com/content/www/us/en/develop/articles/vector-api-developer-program-for-java.html

tornadoVM is also quite fascinating https://github.com/beehive-lab/TornadoVM

LifeIsStrange avatar Feb 21 '21 04:02 LifeIsStrange

A lot depends on what you want to support as the stable public API for the simdjson library. The cleanest language integration may be to make the tape formats part of the stable API, such that the integration has only to expose a C call that takes three buffers (raw json input bytes, plus two empty ones for the output tapes) and invokes the parser accordingly. The Java layer is then responsible for the buffer memory management and only one expensive JNI/panama call is needed per file/document, whilst the more numerous fine-grained operations to iterate the tape can stay entirely in fairly simple Java code and enjoy the usual JIT optimizations. Using the tape format as a language-neutral binary json encoding or cache format may also be an interesting direction.

Keeping stage1 in native code whilst stage2 is in Java presents a more challenging approach for code maintenance and version synchronization, without any clear advantage. Perhaps you can reimplement stage2 in Java, but why would you want to? It's bad enough to have to do it once :-) Only makes sense if you're aiming for a 100% Java reimplementation of the solution rather than an integration. The on-demand API is also architecturally harder, as it means crossing the JNI boundary more frequently. It may pay off, but my suspicion is the lazy approach is less compelling than it would be for C++ projects.

'UTF-8 by default' is a red herring. Java strings internally are either byte[] (if ascii, as it's more compact in memory) or char[] (UTF-16) but that's an implementation detail of java.lang.String. It has constructors for both, but crucially will always memcpy the origin data anyhow, so it's going to hurt equally either way from a performance perspective. The 'by default' part is just about what the assumed charset is if you supply a byte[] without an explicit encoding, which doesn't matter here as the integration code can always be explicit. If using the tape-API approach, you'd lazily create String or CharSequence instances over parts of the strings tape. In the former case unfortunately you pay for encoding validation again even though it's not needed, because the language runtime won't trust you already did it right.

'@HotSpotIntrinsicCandidate' is likewise irrelevant, it's part of the internals of the standard libraries that ship with JVM runtime and is not applicable to user code. You can't supply your own intrinsic implementation, which conceptually would be similar to inlining assembler in C source code. That's a Good Thing for the platform overall, if somewhat frustrating.

JEP-338 is more interesting. Is exposes SIMD functions through Vector and VectorMask objects, making the vector-parallel parts of the implementation possible from Java for the first time. However, there are boundary issues in getting the data in and out of that API efficiently, particularly with regard to treating results in a numeric rather than Object way, such that you can do the subsequent bit-parallel operations on them, even assuming the bit-parallel operation you want is available in Java (hint: it's not). The mison approach of vector use for structural character search, followed by loop/branch processing, is made possible with the new Java vector API, but the current simdjson approach is not.

My feeling is that eventually some fast parsing (though not necessarily explicitly of JSON) using SIMD is likely to be an implementation detail of the JVM's string and number handling internals and implemented as intrinsics (i.e. c/c++) rather than using the Java vector API. The UTF-8 validation or float parsing techniques for example are fairly modular could make their way into the standard library implementation, which would pay off for all use cases, coincidentally including existing Java JSON parsers.

jhalliday avatar Mar 01 '21 15:03 jhalliday

@jhalliday I share your sentiment.

Porting the float parsing routine to Java is certainly something that is doable... https://github.com/fastfloat/fast_float/issues/58 It was ported to C# https://github.com/carlVerret/csFastFloat with good results. Having the code in your own language is always nicer, everything else being equal.

Many of the "dirty tricks" like our stupidly fast UTF8 validation could be integrated within Java.

lemire avatar Mar 01 '21 15:03 lemire

I've been thinking a lot about bindings lately ... given a large overhead for crossing runtime boundaries (as well as the impossibility of cross-boundary optimization), it might make the most sense to:

  • use the simdjson tokenization backend (utf-8 parser, simd stuff)
  • expose the array of indexes to Java
  • implement the On Demand frontend in Java

The approach has proven itself really well in C++, and I think now that the trail is blazed building other bindings might not be terrible to write (though certainly harder than just naked calls to the backend, since you have to implement logic).

Most of the inlining and optimization opportunities for Java are in the frontend, anyway.

jkeiser avatar Mar 02 '21 16:03 jkeiser

Depending on Java's StringBuffer API, I think we could avoid a double string allocation, too.

jkeiser avatar Mar 02 '21 16:03 jkeiser

@jkeiser Though it is probably quite a bit of work, your scenario does sound entirely doable in a few weeks of work.

Note that it naturally extends to C#... @EgorBo ... and maybe other programming languages.

lemire avatar Mar 02 '21 18:03 lemire

@AugustNagro Ported the UTF8 validation to pure Java with vectors...

https://github.com/AugustNagro/utf8.java

Their README is not entirely clear to me, but it seems that the result is negative for the time being (that is, the net result is slower). I have not looked at the code at all.

I am sure that @AugustNagro is a great hacker, but it is not impossible that a thorough code review could help.

cc @richardstartin @jkeiser

lemire avatar Mar 02 '21 22:03 lemire

I wish it were faster too & welcome code review from anyone interested. There really isn't much code. Hopefully I can get away with saying it's the vector api's fault, and not mine.

I don't doubt that someone could implement the whole of simd-json with jdk.incubator.vector, but will it be fast? My loosely-held opinion is: probably not for a while.

AugustNagro avatar Mar 03 '21 04:03 AugustNagro

@AugustNagro What you did is great.

If someone could just drop the JIT assembly output (asm code) then we could tell quickly for sure where the fault lies.

lemire avatar Mar 03 '21 18:03 lemire

@lemire Paul Sandoz from Oracle is taking a look, and has made some interesting findings!

https://mail.openjdk.java.net/pipermail/panama-dev/2021-March/012355.html

AugustNagro avatar Mar 08 '21 20:03 AugustNagro

@AugustNagro Thanks for sharing the link. It does seem that the issue might be in the compiler stack. That is to be expected given how new this technology is.

lemire avatar Mar 09 '21 00:03 lemire

I ran @AugustNagro 's benchmark with the current build of Vector API. The perfomance seems to have improved since then, https://github.com/AugustNagro/utf8.java/pull/3 . Also, exploring perfomance improvements at https://github.com/AugustNagro/utf8.java/pull/4 .

amCap1712 avatar Sep 10 '21 20:09 amCap1712

As of Java 18 UTF-8 is the default: https://inside.java/2021/10/04/the-default-charset-jep400/

JohannesLichtenberger avatar Nov 11 '21 14:11 JohannesLichtenberger

@JohannesLichtenberger But Java will still store strings as UTF-16 though, right?

lemire avatar Nov 11 '21 17:11 lemire

@cl4es

LifeIsStrange avatar Nov 11 '21 19:11 LifeIsStrange

According to Cl4es blog

all Strings that can be will be represented internally in a ISO-8859-1 binary format (latin-1).

if a string contain >= 1 character that is non-ASCII, it will fallback to UTF-16 (not character wise) if I understand correctly.

In JDK 18, cl4es significantly optimized charset encoders, by making e.g UTF 8 use a new intrinsic (for ASCII compatible strings) which make it nearly as fast as latin-1 encoding (the fastest encoder). Unfortunately, it looks like UTF 16 encoding is among the slowest and has not been ported to use the new intrinsic?

so the problems are: UTF 16 is not intrinsified 2) for non-stritctly ASCII strings, UTF 16 is in itself slower than UTF 8 (and less memory efficient if I recall correctly) 3) the external world assume UTF-8 and this is the direction of JEP 400 but internally the JDK will still needlessly transcode UTF 8 inputs into UTF 16 (if it cannot be transcoded into latin-1)? Even the UTF-8 to latin 1 translation seems like needless overhead. What are your thoughts on this @cl4es ? BTW I have zero expertise in this topic and might be saying multiples erroneous statements.

LifeIsStrange avatar Nov 11 '21 20:11 LifeIsStrange

Yes, String remains effectively UTF-16-encoded, with an implementation detail to encode using ISO-8859-1 when possible.

Representing Strings using UTF-8 internally has been brought up a few times. Getting there would be a much larger endeavor than changing the default encoding as proposed by JEP-400. And while decoding/encoding to/from UTF-8 would be a bit (or possibly a lot) faster, it's might not be as clear-cut from a performance point-of-view as you probably think.

First off you'd likely pessimize a number of String operations - such as charAt - which are unfortunately heavily depended on in some rather performance-sensitive places. A lot of internal and third-party APIs would have to be re-examined, or face heavy regressions as instant O(1) lookups turn into O(n) scans.

Had we started from a blank slate and a String that doesn't leak implementation details via indexed accessors (instead pushes users towards iterators or streams) then UTF-8 would have been a much more obviously good choice. With all the legacy code out there today I'm not convinced it'll be worth the disruption to re-imagine the internal representation yet again.

(UTF-16-encoded Strings do see a lot of help from various intrinsics in the OpenJDK. But it does use more memory, which means microbenchmarks will show throughput in many operations involving UTF-16 Strings stay behind roughly at pre-JDK 9 levels. There might be a couple of minor opportunities to improve handling of UTF-16 in the encode/decode area.)

cl4es avatar Nov 11 '21 21:11 cl4es

@cl4es A derived project from simdjson is simdutf. It provides really fast UTF-8 to UTF-16 transcoding as well as a few other nice functions:

https://github.com/simdutf/simdutf

It is portable (NEON, AVX, SSE...):

It is seriously tested and so forth. It could be used in other runtimes.

We do not have UTF-16 to latin1, but that's something we could do. Our code is really fast, please see our benchmarks.

You might also enjoy the following Java library:

https://github.com/AugustNagro/utf8.java

It implements the fast UTF-8 validator found in simdjson.

lemire avatar Nov 11 '21 21:11 lemire

@lemire interesting, thanks. I don't have any experience integrating third party code into the OpenJDK, though, but if you want to collaborate on contributing something I'd be happy to facilitate. I'm not sure we could make good use of the C++ implementation, though.

I'm also not sure the Java library is something we could integrate into String as it stands: The Vector API needs to be finalized first of all, and even after it goes final then having a direct dependency from String on the Vector API might open us up to various bootstrapping issues. Could perhaps be solved by having a mechanism to replace the code at runtime at some point after a primordial String impl and the Vector API has first been initialized, then let that go through JIT optimization. VM intrinsics implemented in Java..! We do code replacement elsewhere, e.g. when starting JFR. It might trigger some deoptimization and a bit of a drag on startup, but shouldn't be impossible with some time and effort..

cl4es avatar Nov 12 '21 00:11 cl4es

@cl4es Sure integrating this library or similar code through the Vector API would be easier but might require to wait for stabilization. However it's always possible to port the UTF-8 algorithm to classical JVM intrinsics such as recently done in Java/util/Base64 by @asgibbons @sviswanathan

LifeIsStrange avatar Nov 18 '21 20:11 LifeIsStrange

A classical JVM intrinsic could work great, yeah, but porting to HotSpot intrinsics will likely take significant effort. I don't have the expertise to do it, and those I know at Oracle who might have a rather busy schedule ahead. If someone is keen on getting this under consideration I think a community effort to prototype it might have the best chance to succeed. I'd be happy to help out as best I can with such an effort.

cl4es avatar Nov 18 '21 21:11 cl4es

The base64 codec was obviously a lot of work (@asgibbons would know), but the UTF-8 validator is actually simpler... So it seems like it is within striking distance.

Doing UTF-8 <=> UTF-16 transcoding would be more work.

@LifeIsStrange Fun fact: the Java/util/Base64 code by @asgibbons is based on a routine from our paper Base64 encoding and decoding at almost the speed of a memory copy (with @WojciechMula). It is the same @WojciechMula from the Transcoding Billions of Unicode Characters per Second with SIMD Instructions paper quoted above.

(This got me very excited because I did not expect anyone to care enough about doing base64 at crazy speeds... to do the hard work needed to push it inside the JDK!!! Very exciting.)

lemire avatar Nov 18 '21 21:11 lemire

Maybe the Vector API also got much more stable in the meantime?

JohannesLichtenberger avatar Aug 14 '22 21:08 JohannesLichtenberger