simdjson-java icon indicating copy to clipboard operation
simdjson-java copied to clipboard

Better AAarch64 performance

Open moscicky opened this issue 1 year ago • 2 comments

Even tough JEP 438 states that both x64 and AArch64 architectures should benefit from new vector api, currently performance of simdjson-java on M1 mac is way worse than other parsers:

Benchmark                                                                   Mode  Cnt     Score    Error  Units
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_fastjson        thrpt    5  1229.991 ± 39.538  ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jackson         thrpt    5  1099.877 ±  9.560  ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jsoniter        thrpt    5   607.902 ± 10.469  ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jsoniter_scala  thrpt    5  1930.694 ± 41.766  ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjson        thrpt    5    26.287 ±  0.295  ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjsonPadded  thrpt    5    26.516 ±  0.686  ops/s

This may be due to the usage of 256 bit vectors, I have found an thread which states that:

on AArch64 NEON, the max hardware vector size is 128 bits. So for 256-bits, we are not able to intrinsify to use SIMD directly, which will fall back to Java implementation of those APIs

When running the benchmark with '-XX:+UnlockDiagnosticVMOptions', '-XX:+PrintIntrinsics' the following output can be observed, supporting this theory:

** not supported: arity=0 op=load vlen=32 etype=byte ismask=no
** not supported: arity=1 op=store vlen=32 etype=byte ismask=no
** not supported: arity=0 op=broadcast vlen=32 etype=byte ismask=0 bcast_mode=0
** not supported: arity=2 op=comp/0 vlen=32 etype=byte ismask=usestore
** not supported: arity=1 op=cast#438/3 vlen2=32 etype2=byte
** not supported: arity=0 op=broadcast vlen=32 etype=byte ismask=0 bcast_mode=0
** not supported: arity=2 op=comp/0 vlen=32 etype=byte ismask=usestore
** not supported: arity=1 op=cast#438/3 vlen2=32 etype2=byte
** not supported: arity=0 op=load vlen=32 etype=byte ismask=no
** not supported: arity=1 op=store vlen=32 etype=byte ismask=no
** not supported: arity=0 op=broadcast vlen=32 etype=byte ismask=0 bcast_mode=0
** not supported: arity=2 op=comp/0 vlen=32 etype=byte ismask=usestore
** not supported: arity=1 op=cast#438/3 vlen2=32 etype2=byte
** not supported: arity=0 op=broadcast vlen=32 etype=byte ismask=0 bcast_mode=0
** not supported: arity=2 op=comp/0 vlen=32 etype=byte ismask=usestore
** not supported: arity=1 op=cast#438/3 vlen2=32 etype2=byte

Obviously AArch64 support is not as important as x64, but it may be interesting to make the implementation flexible to support both architectures. Perhaps the C++ implementation can be used as a reference.

Anyway, great work so far on the Java port, the results on x64 are very impressive!

moscicky avatar Jul 21 '23 19:07 moscicky

Thanks for researching that!

Would you mind adding and running the following test:

    @Test
    public void printPreferableSpecies() {
        System.out.println(ByteVector.SPECIES_PREFERRED);
    }

?

It would tell us what the preferable vector length is for your machine. Unfortunately, it's challenging to just replace ByteVector.SPECIES_256 with ByteVector.SPECIES_PREFERRED in the library. In several places, we have to perform bitwise operations where it's easier to know the length upfront to avoid, for example, using extra masks to extract a relevant part of a long.

In my opinion, to support different vector lengths, we would need to provide dedicated implementations for each length and then, based on ByteVector.SPECIES_PREFERRED, pick the one that is the best for the machine the library is used on.

piotrrzysko avatar Jul 22 '23 11:07 piotrrzysko

No problem, the output is:

Species[byte, 16, S_128_BIT]

The approach you suggest sounds reasonable 👍🏻

moscicky avatar Jul 22 '23 12:07 moscicky