XNNPACK icon indicating copy to clipboard operation
XNNPACK copied to clipboard

QB4W AVX2 GEMM Kernels

Open GregoryComer opened this issue 8 months ago • 2 comments

This pull request adds blockwise 4-bit (qb4w) GEMM microkernels targeting x86 via the AVX2 instruction family.

Note: This PR includes one commit from https://github.com/google/XNNPACK/pull/6557 (Test generation update for qb4w). I'm putting this PR up for review before that PR merges so that we can start the review process.

Tests and benchmarks were run on Intel Ice Lake. I also did some informal benchmarking on Zen 3, which I can include, if desired. Benchmark data includes qc4w benchmarks for comparison. Note that blockwise kernels with block_size equal to kc are functionally equivalent to qc4w, thus qc4w provides a reasonable performance comparison. I expect qb4w with bl=256 to be slightly less performant than qc4w due to the slight increase in memory (~4.125 bits/weight vs ~4 bits per weight for qc4w), as well as due to the slight overhead of the block loop.

AVERAGE of OPS tile_size
n k bl datatype 1x8c8 2x8c8 3x8c8 4x8c8
16 1024 32 qd8_f32_qb4w 30.81 33.59 35.82 25.56
256 qd8_f32_qb4w 37.39 44.77 46.17 31.61
NaN qd8_f32_qc4w 38.99 45.29 48.60 45.71
128 1024 32 qd8_f32_qb4w 30.29 33.55 34.90 25.77
256 qd8_f32_qb4w 37.74 44.63 46.17 32.17
NaN qd8_f32_qc4w 38.91 44.62 49.21 44.86
4096 1024 32 qd8_f32_qb4w 29.71 32.69 34.89 25.58
256 qd8_f32_qb4w 37.24 45.28 45.46 32.03
NaN qd8_f32_qc4w 38.83 46.76 47.19 44.68
11008 4096 32 qd8_f32_qb4w 27.13 32.23 33.80 25.78
256 qd8_f32_qb4w 37.29 44.41 46.50 32.43
NaN qd8_f32_qc4w 36.02 45.84 48.99 44.29
32000 4096 32 qd8_f32_qb4w 19.77 26.41 30.11 25.09
256 qd8_f32_qb4w 30.58 41.53 44.61 32.19
NaN qd8_f32_qc4w 29.53 40.96 45.74 44.25

GregoryComer avatar Jun 24 '24 21:06 GregoryComer