simdjson-java
simdjson-java copied to clipboard
Better AAarch64 performance
Even tough JEP 438 states that both x64 and AArch64 architectures should benefit from new vector api, currently performance of simdjson-java
on M1 mac is way worse than other parsers:
Benchmark Mode Cnt Score Error Units
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_fastjson thrpt 5 1229.991 ± 39.538 ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jackson thrpt 5 1099.877 ± 9.560 ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jsoniter thrpt 5 607.902 ± 10.469 ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jsoniter_scala thrpt 5 1930.694 ± 41.766 ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjson thrpt 5 26.287 ± 0.295 ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjsonPadded thrpt 5 26.516 ± 0.686 ops/s
This may be due to the usage of 256 bit vectors, I have found an thread which states that:
on AArch64 NEON, the max hardware vector size is 128 bits. So for 256-bits, we are not able to intrinsify to use SIMD directly, which will fall back to Java implementation of those APIs
When running the benchmark with '-XX:+UnlockDiagnosticVMOptions', '-XX:+PrintIntrinsics'
the following output can be observed, supporting this theory:
** not supported: arity=0 op=load vlen=32 etype=byte ismask=no
** not supported: arity=1 op=store vlen=32 etype=byte ismask=no
** not supported: arity=0 op=broadcast vlen=32 etype=byte ismask=0 bcast_mode=0
** not supported: arity=2 op=comp/0 vlen=32 etype=byte ismask=usestore
** not supported: arity=1 op=cast#438/3 vlen2=32 etype2=byte
** not supported: arity=0 op=broadcast vlen=32 etype=byte ismask=0 bcast_mode=0
** not supported: arity=2 op=comp/0 vlen=32 etype=byte ismask=usestore
** not supported: arity=1 op=cast#438/3 vlen2=32 etype2=byte
** not supported: arity=0 op=load vlen=32 etype=byte ismask=no
** not supported: arity=1 op=store vlen=32 etype=byte ismask=no
** not supported: arity=0 op=broadcast vlen=32 etype=byte ismask=0 bcast_mode=0
** not supported: arity=2 op=comp/0 vlen=32 etype=byte ismask=usestore
** not supported: arity=1 op=cast#438/3 vlen2=32 etype2=byte
** not supported: arity=0 op=broadcast vlen=32 etype=byte ismask=0 bcast_mode=0
** not supported: arity=2 op=comp/0 vlen=32 etype=byte ismask=usestore
** not supported: arity=1 op=cast#438/3 vlen2=32 etype2=byte
Obviously AArch64 support is not as important as x64, but it may be interesting to make the implementation flexible to support both architectures. Perhaps the C++ implementation can be used as a reference.
Anyway, great work so far on the Java port, the results on x64 are very impressive!
Thanks for researching that!
Would you mind adding and running the following test:
@Test
public void printPreferableSpecies() {
System.out.println(ByteVector.SPECIES_PREFERRED);
}
?
It would tell us what the preferable vector length is for your machine. Unfortunately, it's challenging to just replace ByteVector.SPECIES_256
with ByteVector.SPECIES_PREFERRED
in the library. In several places, we have to perform bitwise operations where it's easier to know the length upfront to avoid, for example, using extra masks to extract a relevant part of a long.
In my opinion, to support different vector lengths, we would need to provide dedicated implementations for each length and then, based on ByteVector.SPECIES_PREFERRED
, pick the one that is the best for the machine the library is used on.
No problem, the output is:
Species[byte, 16, S_128_BIT]
The approach you suggest sounds reasonable 👍🏻