folly
folly copied to clipboard
Checksum performance is slow on Arm64
The checksum performance in folly is not optimized on Arm64 with Neon, which induce that the performance is quite slow.
./folly/hash/detail/ChecksumDetail.h
Cachelib heavily rely on Folly to realize the checksum.
From the perf top, in the cachelib with hyprid cache configuration, the checksum is consuming a lot of CPU time, which has been a bottleneck.
Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
$ checksum_benchmark --bm_min_usec=10000
============================================================================
folly/hash/test/ChecksumBenchmark.cpp relative time/iter iters/s
============================================================================
crc32_512 55.73ns 17.94M
crc32_1024 85.15ns 11.74M
crc32_2048 116.29ns 8.60M
crc32_4096 191.03ns 5.23M
crc32_8192 341.44ns 2.93M
crc32_16384 627.76ns 1.59M
crc32_32768 1.21us 827.16K
============================================================================
Comparison:
============================================================================
[...]folly/hash/test/ChecksumBenchmark.cpp relative time/iter iters/s
============================================================================
crc32_512 1.80us 554.82K
crc32_1024 3.58us 279.35K
crc32_2048 7.14us 140.13K
crc32_4096 14.25us 70.18K
crc32_8192 28.47us 35.12K
crc32_16384 56.93us 17.57K
crc32_32768 113.83us 8.79K
If checksum is the bottleneck, the first thing I'd recommend doing is shifting away from using crc32, which, even fully optimized on x86_64 is less than 1/4th the speed of hash algorithms designed for speed like XXH3. XXH3 in particular should be well optimized for AArch64.
It does appear that there are equivalent hardware instructions to do the CRC32 hashing on ARM, we just haven't implemented it yet since we haven't needed it.