simdjson-java icon indicating copy to clipboard operation
simdjson-java copied to clipboard

Support SPECIES_128

Open Squiry opened this issue 1 year ago • 7 comments
trafficstars

Should help with #9, the performance is still kind of low though (half of what jsoniter shows)

Squiry avatar Feb 29 '24 14:02 Squiry

Thanks for the contribution!

I'm a bit busy right now, working on a feature for the parser that I'll hopefully finish this month, so I can't promise when I'll be able to look at your PR, but I'll definitely do so. I believe that the most important thing is to make sure that this change doesn't affect the most common cases (256-bit and 512-bit registers).

piotrrzysko avatar Mar 03 '24 15:03 piotrrzysko

I've run the benchmarks on a machine with Neoverse-N1 CPU:

Architecture:             aarch64
  CPU op-mode(s):         32-bit, 64-bit
  Byte Order:             Little Endian
CPU(s):                   2
  On-line CPU(s) list:    0,1
Vendor ID:                ARM
  Model name:             Neoverse-N1
    Model:                1
    Thread(s) per core:   1
    Core(s) per socket:   2
    Socket(s):            1
    Stepping:             r3p1
    BogoMIPS:             243.75
    Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid 

and the results are indeed unsatisfactory:

Benchmark                                                                              Mode  Cnt    Score   Error  Units
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_fastjson                   thrpt    5  436.897 ± 1.512  ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jackson                    thrpt    5  380.908 ± 0.816  ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjson                   thrpt    5  197.846 ± 0.894  ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjsonPadded             thrpt    5  199.902 ± 0.545  ops/s
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_fastjson        thrpt    5  626.115 ± 1.175  ops/s
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jackson         thrpt    5  463.471 ± 0.881  ops/s
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jsoniter_scala  thrpt    5  871.302 ± 4.688  ops/s
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjson        thrpt    5  213.725 ± 0.452  ops/s
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjsonPadded  thrpt    5  216.995 ± 0.329  ops/s

I'd like to understand where the disparity between 256/512-bit and 128-bit vectors comes from (see results in README for Intel CPUs). Currently, I don't have space to investigate this. Would you like to do it, or would you like me to come back to it when I have time?

piotrrzysko avatar Apr 30 '24 05:04 piotrrzysko

I'd like to understand where the disparity between 256/512-bit and 128-bit vectors comes from

The way I've implement that feature for 128bit is not the same as the arm64 implementation in original repo. They take a little bit different approach there, but I don't think we need that kind of details here anyway.

Squiry avatar May 02 '24 15:05 Squiry

I think your code looks good. By the disparity between 256/512-bit and 128-bit vectors I meant the difference in performance. As you can see in README for the (SchemaBased)ParseAndSelectBenchmark simdjson-java is typically 3-4 times faster than other libraries. However, based on the results I shared in my previous comment, it appears that for 128-bit vectors, the performance doesn't even match that of other libraries. I'm curious about the root cause of this difference. Could it simply be due to narrower registers? Or perhaps there's something else we're missing?

piotrrzysko avatar May 05 '24 16:05 piotrrzysko

That's interesting. My MacBook with m1max gives me different result:

Benchmark                                                                              Mode  Cnt     Score    Error  Units
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_fastjson        thrpt    5  1874.904 ±  8.548  ops/s
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jackson         thrpt    5  1044.073 ± 39.591  ops/s
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jsoniter_scala  thrpt    5  2153.209 ± 22.102  ops/s
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjson        thrpt    5  1120.909 ± 16.372  ops/s
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjsonPadded  thrpt    5  1131.995 ± 42.193  ops/s

It's still bad, but not even close that bad.

Squiry avatar May 05 '24 22:05 Squiry