folly
folly copied to clipboard
Optimize crc32 & crc32c on NVIDIA Grace
This pull request adds hardware accelerated routines for CRC32 and CRC32C for Arm AARCH64 CPUs. The changes here have been tested on NVIDIA Grace. In detail, it contains routines for:
- Computing CRC32 and CRC32C hashes on dataset using the CRC intrinsics. On Grace/Neoverse V2, this can process 8 bytes/cycle.
- A vectorized implementation of the
gf_multiply_crc32c_hwandgf_multiply_crc32_hwfunctions used in routines to merge partial CRC checksums. These functions are more or less a 1:1 translation of the x86 vectorized routines. - I've introduced feature flags for AES, and SHA extensions for Arm CPUs. The feature checks for the vectorized functions are a bit more messy than on x86 because CPUs can implement a subset of these extensions.
This should resolve issue #2027.
@Orvid has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
Thanks for the review! I forgot to add that this should be compiled with the flags
python3 build/fbcode_builder/getdeps.py --allow-system-packages build --extra-cmake-defines '{"CMAKE_CXX_FLAGS": "-march=armv8.5-a+crc+crypto"}'
or similar (+crypto could be replaced by +aes+sha2?) to enable all required features.
@Orvid has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@krenzland Hey after internal discussions, we would like to request to move your contributions under folly/external/nvidia-crc32 to have a more defined copyright lines, you can still define them under folly namespace.
AFAIU, CMake should automatically pick them up, as we have auto_source with recurse.
Thanks in advanced
@Orvid has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@Orvid has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@r1mikey merged this pull request in facebook/folly@8fc0e33470c2611973faa5f3abe8e6bc9845aaab.