valkey icon indicating copy to clipboard operation
valkey copied to clipboard

Optimize CRC16 using multi-byte LUT

Open Cuda-Chen opened this issue 1 month ago • 3 comments

Optimize CRC16 using multi-byte LUT.

Cuda-Chen avatar Oct 31 '25 12:10 Cuda-Chen

Hi @zuiderkwast ,

I would like to share some benchmark results of this PR. If we are going to merge this PR, there are at least two works waiting for finishing:

  • jump-table for reduction
  • code tidy (e.g., using macro)

Environment

  • CPU: Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz
  • OS: Linux 5.15.0-157-generic
  • For all "reduction", it reduces from the target length to 1-byte (e.g., 8-byte: 8 -> 4 -> 2 -> 1)

Benchmark Results

baseline (commit f54818cc60597e9fe5dc03a52fd39ab944cd4932 on unstable branch

# -t get,set
SET: 292902.97 requests per second, p50=1.431 msec                    
GET: 350557.41 requests per second, p50=1.199 msec

# -t get
GET: 454277.00 requests per second, p50=0.975 msec

2-byte reduction

# -t set,get
SET: 321502.06 requests per second, p50=1.351 msec                    
GET: 387236.69 requests per second, p50=1.151 msec

# -t get
GET: 453309.16 requests per second, p50=0.959 msec

4-byte reduction

# -t set,get
SET: 324191.12 requests per second, p50=1.335 msec                    
GET: 376364.31 requests per second, p50=1.135 msec

# -t get
GET: 470167.88 requests per second, p50=0.943 msec

8-byte reduction

# -t set,get
SET: 319550.06 requests per second, p50=1.375 msec                    
GET: 384349.31 requests per second, p50=1.143 msec

# -t get
GET: 469329.34 requests per second, p50=0.959 msec

16-byte reduction

# -t set,get
SET: 300210.12 requests per second, p50=1.431 msec                    
GET: 391880.25 requests per second, p50=1.143 msec 

# -t get
GET: 443066.03 requests per second, p50=0.943 msec

Cuda-Chen avatar Oct 31 '25 12:10 Cuda-Chen

Interesting! But with a larger LUT, when it is loaded into L1 cache, other stuff are evicted that might be needed later, so it may depend on use case if it's actually faster in all cases.

How much memory would the 2-byte reduction variant use? The current draft is for the 16-byte reduction, right?

zuiderkwast avatar Nov 06 '25 17:11 zuiderkwast

Hi @zuiderkwast ,

How much memory would the 2-byte reduction variant use?

It will use 1 KB memory for LUT (256 entries * 2 bytes of each entry * 2 times for 2-byte lookup at the same time = 1024 bytes = 1 KB).

The current draft is for the 16-byte reduction, right?

Yes, the current draft uses 16-byte reduction.

Cuda-Chen avatar Nov 07 '25 03:11 Cuda-Chen