valkey [NEW] Accelerate CRC16 with SIMD

Accelerate CRC16 with NEON and AVX

Apr 30 '25 11:04 xbasel

I have found a Rust implementation that we can try to translate to C.

https://github.com/TobiasBengtsson/crc-fast-rs/tree/master/crc16-xmodem-fast

It uses carry-less multiplication instructions and on Arm it uses aes instructions somehow.

It's lengthy so I doubt it's fast for very short input. Keys are usually short, ~20 bytes.

Apr 30 '25 22:04 zuiderkwast

Apparently, the Rust implementation falls back to the simple old table lookup implementation for strings shorter than 128 bytes. https://github.com/TobiasBengtsson/crc-fast-rs/blob/master/crc-fast-gen/src/lib.rs#L222

I doubt that it's possible to optimize with SIMD for short strings.

@xbasel Do you want to optimize for long keynames? I don't think we should, because it's not a common use case.

May 07 '25 12:05 zuiderkwast

These? https://gcc.gnu.org/onlinedocs/gcc/CRC-Builtins.html

May 11 '25 08:05 zuiderkwast

Hi @xbasel and @zuiderkwast ,

I have done the acceleration of NEON for CRC32-C on ARM (particularly Armv7-A and Armv8-A 32-bit) with both lookup table and carry-less multiplication (reference: https://github.com/DLTcollab/sse2neon/pull/627/files). If this issue is still active, I would like to handle this (~I will implement ARM part first~ after reading the code, I think I should propose a full schedule). For adding the assignee for me will be great, if possible.

Jun 28 '25 12:06 Cuda-Chen

@Cuda-Chen Key names are typically only 20 bytes long. Is there any performance benefit for this length?

IIUC, these techniques would only optimeze payloads of length 128 or longer. Do you have another experience?

We should optimize this primarily for short keys, let's say up to 32 bytes. If we add branching that makes it slower for short keys, I don't think it's worth it, but i'd like to see some numbers and also see the code complexity to decide.

Jun 28 '25 18:06 zuiderkwast

Hi @zuiderkwast ,

IIUC, these techniques would only optimeze payloads of length 128 or longer. Do you have another experience?

These techniques is for optimization utilizing both SIMD and cache. Using carry-less multiplication, there is no need to store a lookup table; whereas the half-byte lookup table can fit the whole table into an L1 cache (usually 64 Bytes) in the proposed PR. For payloads at least 128, we can rather use the characteristic that XORs are associative (you can see section "OPTION 12: 8-byte Hardware-accelerated" of this post) so that:

We split input message into segments.
Calculate the CRC of each segment (we can utilize ILP, or Instruction Level Parallelism, on modern CPU).
Associate all segments to get the whole CRC of input message.

Key names are typically only 20 bytes long. Is there any performance benefit for this length?

For my first glance, the table in crc16.c needs 512 Bytes (2 Bytes of uint16_t and there are 256 elements), which obviously cannot fit into an L1 cache. Though it requires more operation and introduces instruction dependency in calculation part, I think we can first try to reduce the lookup table size to 64 Bytes.

Jun 29 '25 07:06 Cuda-Chen

Hi @zuiderkwast ,

Thanks for assigning this issue to me. I will list the tasks for the ease of tracking the progress. As such, I would like to ask the goal of solving this issue:

What types of SIMD framework we are going to support?
What types of CPU architecture we are going to support? As the chaos status of ARM instruction set (I doubt whether someone will run this project on Armv7-A server, for instance), I suggest we can first limit the support list of CPU architectures especially ARM.

Jun 29 '25 12:06 Cuda-Chen

Main focus is modern amd64 (any SSE or AVX) and aarch64 (e.g. NEON), compilers GCC and Clang. We can even start with just one of them to keep it small.

Other platforms are not important to optimize, but let's keep a fallback to standard C so it still runs on all systems including old 32-bit, s390x, RISC-V and who knows what.

Jun 29 '25 13:06 zuiderkwast

Thanks for the help of @zuiderkwast , I list the tasks for this issue:

[ ] Optimize CRC16 without any certain SIMD framework
- Add unit tests as well.
- ~Add compiler tests (GCC and Clang)~.
- Some kind of SWAR or LUT method maybe acceptable.
  - Update: the half-byte with prefetching makes less efficient (decrease about 5% on both RPS and latency).
[ ] Optimize CRC16 on amd64
- TBD: SSE, AVX(AVX, AVX2, AVX512)
[ ] Optimize CRC16 on aarch64
- NEON

Jun 29 '25 14:06 Cuda-Chen

Sounds good. Maybe a smaller LUT is the most promising idea for small payloads?

We already run tests in CI and build with GCC/Clang and on various platforms.

Jun 29 '25 23:06 zuiderkwast

Hi @zuiderkwast ,

Maybe a smaller LUT is the most promising idea for small payloads?

For my current understanding, LUT is the most promising idea for fixed-length and small payloads.

We already run tests in CI and build with GCC/Clang and on various platforms.

Thanks for the reminder, I will strike out this item.

Jun 30 '25 03:06 Cuda-Chen