Fast OnDemand parsing for Neoverse
This PR uses the same approach than x86 for doing the OnDemand parsing on ARM. On a NVIDIA Grace cpu this results in a 5x speedup for the twitter benchmark and ~3x for citm_catalog.
We use the simdjson simd8x64 type to obtain a 64 bit mask that allows us to operate on 64 characters at a time. Although the bitmask obtention is expensive and requires several neon instructions, it makes us able to process 64 characters per instruction using the bitmaps. If we instead use the shrn instructions we would be able to process only 16 characters per instruction.
This patch also uses this approach in the sve code but using neon instructions, In the Neoverse v2 optimization guide the comparison operation has a latency of 4 cycles and a throughput of 1 instruction per cycle while for neon instructions the latency is 2 cycles and throughput is 4 instructions per cycle.
Benchmark results build/benchmark/bench --benchmark_filter=SonicOnDema
Master branch
twitter/SonicOnDemand_Normal 111149 ns 111152 ns 6297 bytes_per_second=2.21522Gi/s Normal
citm_catalog/SonicOnDemand_Fronter 33629 ns 33630 ns 20804 bytes_per_second=47.8316Gi/s Fronter
twitter/SonicOnDemand_NotFound 111161 ns 111165 ns 6298 bytes_per_second=2.21496Gi/s NotFound
This PR
twitter/SonicOnDemand_Normal 22625 ns 22624 ns 30718 bytes_per_second=10.8832Gi/s Normal
citm_catalog/SonicOnDemand_Fronter 12861 ns 12862 ns 54399 bytes_per_second=125.067Gi/s Fronter
twitter/SonicOnDemand_NotFound 22423 ns 22422 ns 31349 bytes_per_second=10.9814Gi/s NotFound
This PR is contributed by NVIDIA