BitFaster.Caching icon indicating copy to clipboard operation
BitFaster.Caching copied to clipboard

Allocate pinned buffer for vectorized code

Open bitfaster opened this issue 1 year ago • 2 comments
trafficstars

Current AVX2 vectorized code doesn't have much of an advantage on .NET8 and .NET9. We can gain some speed by using the pinned object heap introduced in .NET5 and eliminating the fixed statement. With a fixed address, we can also do a trick to align to 32 bytes, which is best for AVX instructions.

Use of fixed results in a fixed local variable in IL, the runtime overhead comes from the JITted code. Explanation here.

This has a greater impact on the increment code path, and when the table is smaller (because L1 cache misses are fewer so fixed is relatively more overhead).

Inc baseline

BitFaster Caching Benchmarks Lfu SketchIncrement-columnchart

Tabular .NET8
Method Size Mean Error StdDev Ratio Allocated
IncFlat 32768 12.04 ns 0.127 ns 0.113 ns 1.00 -
IncBlockAvx 32768 12.15 ns 0.110 ns 0.092 ns 1.01 -
IncFlat 524288 22.14 ns 0.414 ns 0.323 ns 1.00 -
IncBlockAvx 524288 17.32 ns 0.250 ns 0.222 ns 0.78 -
IncFlat 8388608 67.38 ns 1.336 ns 1.184 ns 1.00 -
IncBlockAvx 8388608 62.36 ns 0.408 ns 0.362 ns 0.93 -
IncFlat 134217728 88.89 ns 1.592 ns 1.834 ns 1.00 -
IncBlockAvx 134217728 75.63 ns 0.415 ns 0.388 ns 0.85 -

Inc pinned + 32 byte align

(chart title is wrong, but this is the inc test) BitFaster Caching Benchmarks Lfu SketchIncrement-columnchart

Tabular .NET8
Method Size Mean Error StdDev Ratio Allocated
IncFlat 32768 12.07 ns 0.109 ns 0.097 ns 1.00 -
IncBlockAvx 32768 11.49 ns 0.043 ns 0.041 ns 0.95 -
IncFlat 524288 21.93 ns 0.391 ns 0.573 ns 1.00 -
IncBlockAvx 524288 20.58 ns 0.331 ns 0.310 ns 0.93 -
IncFlat 8388608 66.81 ns 0.571 ns 0.477 ns 1.00 -
IncBlockAvx 8388608 48.79 ns 0.421 ns 0.394 ns 0.73 -
IncFlat 134217728 88.23 ns 0.950 ns 0.793 ns 1.00 -
IncBlockAvx 134217728 60.81 ns 0.559 ns 0.523 ns 0.69 -

Freq Baseline

BitFaster Caching Benchmarks Lfu SketchFrequency-columnchart

Tabular .NET8
Method Size Mean Error StdDev Ratio Allocated
FrequencyFlat 32768 19.50 ns 0.075 ns 0.058 ns 1.00 -
FrequencyBlockAvx 32768 15.13 ns 0.150 ns 0.133 ns 0.78 -
FrequencyFlat 524288 32.67 ns 0.650 ns 0.773 ns 1.00 -
FrequencyBlockAvx 524288 23.40 ns 0.398 ns 0.458 ns 0.72 -
FrequencyFlat 8388608 118.10 ns 0.665 ns 0.590 ns 1.00 -
FrequencyBlockAvx 8388608 64.05 ns 0.471 ns 0.393 ns 0.54 -
FrequencyFlat 134217728 148.67 ns 1.288 ns 1.075 ns 1.00 -
FrequencyBlockAvx 134217728 77.84 ns 1.510 ns 1.484 ns 0.53 -

Freq pinned + 32 byte align

BitFaster Caching Benchmarks Lfu SketchFrequency-columnchart

Tabular .NET8
Method Size Mean Error StdDev Ratio Allocated
FrequencyFlat 32768 20.50 ns 0.151 ns 0.134 ns 1.00 -
FrequencyBlockAvx 32768 13.72 ns 0.093 ns 0.087 ns 0.67 -
FrequencyFlat 524288 31.98 ns 0.609 ns 1.460 ns 1.00 -
FrequencyBlockAvx 524288 22.36 ns 0.439 ns 0.431 ns 0.66 -
FrequencyFlat 8388608 119.80 ns 2.053 ns 1.921 ns 1.00 -
FrequencyBlockAvx 8388608 61.45 ns 0.596 ns 0.558 ns 0.51 -
FrequencyFlat 134217728 148.59 ns 1.448 ns 1.284 ns 1.00 -
FrequencyBlockAvx 134217728 72.96 ns 0.530 ns 0.496 ns 0.49 -

bitfaster avatar May 29 '24 00:05 bitfaster