BitFaster.Caching
BitFaster.Caching copied to clipboard
Allocate pinned buffer for vectorized code
trafficstars
Current AVX2 vectorized code doesn't have much of an advantage on .NET8 and .NET9. We can gain some speed by using the pinned object heap introduced in .NET5 and eliminating the fixed statement. With a fixed address, we can also do a trick to align to 32 bytes, which is best for AVX instructions.
Use of fixed results in a fixed local variable in IL, the runtime overhead comes from the JITted code. Explanation here.
This has a greater impact on the increment code path, and when the table is smaller (because L1 cache misses are fewer so fixed is relatively more overhead).
Inc baseline
Tabular .NET8
| Method | Size | Mean | Error | StdDev | Ratio | Allocated |
|---|---|---|---|---|---|---|
| IncFlat | 32768 | 12.04 ns | 0.127 ns | 0.113 ns | 1.00 | - |
| IncBlockAvx | 32768 | 12.15 ns | 0.110 ns | 0.092 ns | 1.01 | - |
| IncFlat | 524288 | 22.14 ns | 0.414 ns | 0.323 ns | 1.00 | - |
| IncBlockAvx | 524288 | 17.32 ns | 0.250 ns | 0.222 ns | 0.78 | - |
| IncFlat | 8388608 | 67.38 ns | 1.336 ns | 1.184 ns | 1.00 | - |
| IncBlockAvx | 8388608 | 62.36 ns | 0.408 ns | 0.362 ns | 0.93 | - |
| IncFlat | 134217728 | 88.89 ns | 1.592 ns | 1.834 ns | 1.00 | - |
| IncBlockAvx | 134217728 | 75.63 ns | 0.415 ns | 0.388 ns | 0.85 | - |
Inc pinned + 32 byte align
(chart title is wrong, but this is the inc test)
Tabular .NET8
| Method | Size | Mean | Error | StdDev | Ratio | Allocated |
|---|---|---|---|---|---|---|
| IncFlat | 32768 | 12.07 ns | 0.109 ns | 0.097 ns | 1.00 | - |
| IncBlockAvx | 32768 | 11.49 ns | 0.043 ns | 0.041 ns | 0.95 | - |
| IncFlat | 524288 | 21.93 ns | 0.391 ns | 0.573 ns | 1.00 | - |
| IncBlockAvx | 524288 | 20.58 ns | 0.331 ns | 0.310 ns | 0.93 | - |
| IncFlat | 8388608 | 66.81 ns | 0.571 ns | 0.477 ns | 1.00 | - |
| IncBlockAvx | 8388608 | 48.79 ns | 0.421 ns | 0.394 ns | 0.73 | - |
| IncFlat | 134217728 | 88.23 ns | 0.950 ns | 0.793 ns | 1.00 | - |
| IncBlockAvx | 134217728 | 60.81 ns | 0.559 ns | 0.523 ns | 0.69 | - |
Freq Baseline
Tabular .NET8
| Method | Size | Mean | Error | StdDev | Ratio | Allocated |
|---|---|---|---|---|---|---|
| FrequencyFlat | 32768 | 19.50 ns | 0.075 ns | 0.058 ns | 1.00 | - |
| FrequencyBlockAvx | 32768 | 15.13 ns | 0.150 ns | 0.133 ns | 0.78 | - |
| FrequencyFlat | 524288 | 32.67 ns | 0.650 ns | 0.773 ns | 1.00 | - |
| FrequencyBlockAvx | 524288 | 23.40 ns | 0.398 ns | 0.458 ns | 0.72 | - |
| FrequencyFlat | 8388608 | 118.10 ns | 0.665 ns | 0.590 ns | 1.00 | - |
| FrequencyBlockAvx | 8388608 | 64.05 ns | 0.471 ns | 0.393 ns | 0.54 | - |
| FrequencyFlat | 134217728 | 148.67 ns | 1.288 ns | 1.075 ns | 1.00 | - |
| FrequencyBlockAvx | 134217728 | 77.84 ns | 1.510 ns | 1.484 ns | 0.53 | - |
Freq pinned + 32 byte align
Tabular .NET8
| Method | Size | Mean | Error | StdDev | Ratio | Allocated |
|---|---|---|---|---|---|---|
| FrequencyFlat | 32768 | 20.50 ns | 0.151 ns | 0.134 ns | 1.00 | - |
| FrequencyBlockAvx | 32768 | 13.72 ns | 0.093 ns | 0.087 ns | 0.67 | - |
| FrequencyFlat | 524288 | 31.98 ns | 0.609 ns | 1.460 ns | 1.00 | - |
| FrequencyBlockAvx | 524288 | 22.36 ns | 0.439 ns | 0.431 ns | 0.66 | - |
| FrequencyFlat | 8388608 | 119.80 ns | 2.053 ns | 1.921 ns | 1.00 | - |
| FrequencyBlockAvx | 8388608 | 61.45 ns | 0.596 ns | 0.558 ns | 0.51 | - |
| FrequencyFlat | 134217728 | 148.59 ns | 1.448 ns | 1.284 ns | 1.00 | - |
| FrequencyBlockAvx | 134217728 | 72.96 ns | 0.530 ns | 0.496 ns | 0.49 | - |