simdjson-java
simdjson-java copied to clipboard
Support SPECIES_128
Should help with #9, the performance is still kind of low though (half of what jsoniter shows)
Thanks for the contribution!
I'm a bit busy right now, working on a feature for the parser that I'll hopefully finish this month, so I can't promise when I'll be able to look at your PR, but I'll definitely do so. I believe that the most important thing is to make sure that this change doesn't affect the most common cases (256-bit and 512-bit registers).
I've run the benchmarks on a machine with Neoverse-N1 CPU:
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Vendor ID: ARM
Model name: Neoverse-N1
Model: 1
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
Stepping: r3p1
BogoMIPS: 243.75
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid
and the results are indeed unsatisfactory:
Benchmark Mode Cnt Score Error Units
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_fastjson thrpt 5 436.897 ± 1.512 ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jackson thrpt 5 380.908 ± 0.816 ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjson thrpt 5 197.846 ± 0.894 ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjsonPadded thrpt 5 199.902 ± 0.545 ops/s
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_fastjson thrpt 5 626.115 ± 1.175 ops/s
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jackson thrpt 5 463.471 ± 0.881 ops/s
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jsoniter_scala thrpt 5 871.302 ± 4.688 ops/s
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjson thrpt 5 213.725 ± 0.452 ops/s
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjsonPadded thrpt 5 216.995 ± 0.329 ops/s
I'd like to understand where the disparity between 256/512-bit and 128-bit vectors comes from (see results in README for Intel CPUs). Currently, I don't have space to investigate this. Would you like to do it, or would you like me to come back to it when I have time?
I'd like to understand where the disparity between 256/512-bit and 128-bit vectors comes from
The way I've implement that feature for 128bit is not the same as the arm64 implementation in original repo. They take a little bit different approach there, but I don't think we need that kind of details here anyway.
I think your code looks good. By the disparity between 256/512-bit and 128-bit vectors I meant the difference in performance. As you can see in README for the (SchemaBased)ParseAndSelectBenchmark simdjson-java is typically 3-4 times faster than other libraries. However, based on the results I shared in my previous comment, it appears that for 128-bit vectors, the performance doesn't even match that of other libraries. I'm curious about the root cause of this difference. Could it simply be due to narrower registers? Or perhaps there's something else we're missing?
That's interesting. My MacBook with m1max gives me different result:
Benchmark Mode Cnt Score Error Units
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_fastjson thrpt 5 1874.904 ± 8.548 ops/s
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jackson thrpt 5 1044.073 ± 39.591 ops/s
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jsoniter_scala thrpt 5 2153.209 ± 22.102 ops/s
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjson thrpt 5 1120.909 ± 16.372 ops/s
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjsonPadded thrpt 5 1131.995 ± 42.193 ops/s
It's still bad, but not even close that bad.