Fast OnDemand parsing for Neoverse

Open emcastillo opened this issue 1 year ago • 0 comments

This PR uses the same approach than x86 for doing the OnDemand parsing on ARM. On a NVIDIA Grace cpu this results in a 5x speedup for the twitter benchmark and ~3x for citm_catalog.

We use the simdjson simd8x64 type to obtain a 64 bit mask that allows us to operate on 64 characters at a time. Although the bitmask obtention is expensive and requires several neon instructions, it makes us able to process 64 characters per instruction using the bitmaps. If we instead use the shrn instructions we would be able to process only 16 characters per instruction.

This patch also uses this approach in the sve code but using neon instructions, In the Neoverse v2 optimization guide the comparison operation has a latency of 4 cycles and a throughput of 1 instruction per cycle while for neon instructions the latency is 2 cycles and throughput is 4 instructions per cycle.

Benchmark results build/benchmark/bench --benchmark_filter=SonicOnDema

Master branch

twitter/SonicOnDemand_Normal           111149 ns       111152 ns         6297 bytes_per_second=2.21522Gi/s Normal
citm_catalog/SonicOnDemand_Fronter      33629 ns        33630 ns        20804 bytes_per_second=47.8316Gi/s Fronter
twitter/SonicOnDemand_NotFound         111161 ns       111165 ns         6298 bytes_per_second=2.21496Gi/s NotFound

This PR

twitter/SonicOnDemand_Normal            22625 ns        22624 ns        30718 bytes_per_second=10.8832Gi/s Normal
citm_catalog/SonicOnDemand_Fronter      12861 ns        12862 ns        54399 bytes_per_second=125.067Gi/s Fronter
twitter/SonicOnDemand_NotFound          22423 ns        22422 ns        31349 bytes_per_second=10.9814Gi/s NotFound

This PR is contributed by NVIDIA

Sep 11 '24 02:09 emcastillo