XNNPACK
XNNPACK copied to clipboard
QB4W AVX2 GEMM Kernels
This pull request adds blockwise 4-bit (qb4w) GEMM microkernels targeting x86 via the AVX2 instruction family.
Note: This PR includes one commit from https://github.com/google/XNNPACK/pull/6557 (Test generation update for qb4w). I'm putting this PR up for review before that PR merges so that we can start the review process.
Tests and benchmarks were run on Intel Ice Lake. I also did some informal benchmarking on Zen 3, which I can include, if desired. Benchmark data includes qc4w benchmarks for comparison. Note that blockwise kernels with block_size equal to kc are functionally equivalent to qc4w, thus qc4w provides a reasonable performance comparison. I expect qb4w with bl=256 to be slightly less performant than qc4w due to the slight increase in memory (~4.125 bits/weight vs ~4 bits per weight for qc4w), as well as due to the slight overhead of the block loop.
AVERAGE of OPS | tile_size | ||||||
---|---|---|---|---|---|---|---|
n | k | bl | datatype | 1x8c8 | 2x8c8 | 3x8c8 | 4x8c8 |
16 | 1024 | 32 | qd8_f32_qb4w | 30.81 | 33.59 | 35.82 | 25.56 |
256 | qd8_f32_qb4w | 37.39 | 44.77 | 46.17 | 31.61 | ||
NaN | qd8_f32_qc4w | 38.99 | 45.29 | 48.60 | 45.71 | ||
128 | 1024 | 32 | qd8_f32_qb4w | 30.29 | 33.55 | 34.90 | 25.77 |
256 | qd8_f32_qb4w | 37.74 | 44.63 | 46.17 | 32.17 | ||
NaN | qd8_f32_qc4w | 38.91 | 44.62 | 49.21 | 44.86 | ||
4096 | 1024 | 32 | qd8_f32_qb4w | 29.71 | 32.69 | 34.89 | 25.58 |
256 | qd8_f32_qb4w | 37.24 | 45.28 | 45.46 | 32.03 | ||
NaN | qd8_f32_qc4w | 38.83 | 46.76 | 47.19 | 44.68 | ||
11008 | 4096 | 32 | qd8_f32_qb4w | 27.13 | 32.23 | 33.80 | 25.78 |
256 | qd8_f32_qb4w | 37.29 | 44.41 | 46.50 | 32.43 | ||
NaN | qd8_f32_qc4w | 36.02 | 45.84 | 48.99 | 44.29 | ||
32000 | 4096 | 32 | qd8_f32_qb4w | 19.77 | 26.41 | 30.11 | 25.09 |
256 | qd8_f32_qb4w | 30.58 | 41.53 | 44.61 | 32.19 | ||
NaN | qd8_f32_qc4w | 29.53 | 40.96 | 45.74 | 44.25 |