XNNPACK QB4W AVX2 GEMM Kernels

QB4W AVX2 GEMM Kernels

Open GregoryComer opened this issue 8 months ago • 2 comments

This pull request adds blockwise 4-bit (qb4w) GEMM microkernels targeting x86 via the AVX2 instruction family.

Note: This PR includes one commit from https://github.com/google/XNNPACK/pull/6557 (Test generation update for qb4w). I'm putting this PR up for review before that PR merges so that we can start the review process.

Tests and benchmarks were run on Intel Ice Lake. I also did some informal benchmarking on Zen 3, which I can include, if desired. Benchmark data includes qc4w benchmarks for comparison. Note that blockwise kernels with block_size equal to kc are functionally equivalent to qc4w, thus qc4w provides a reasonable performance comparison. I expect qb4w with bl=256 to be slightly less performant than qc4w due to the slight increase in memory (~4.125 bits/weight vs ~4 bits per weight for qc4w), as well as due to the slight overhead of the block loop.

AVERAGE of OPS				tile_size
n	k	bl	datatype	1x8c8	2x8c8	3x8c8	4x8c8
16	1024	32	qd8_f32_qb4w	30.81	33.59	35.82	25.56
		256	qd8_f32_qb4w	37.39	44.77	46.17	31.61
		NaN	qd8_f32_qc4w	38.99	45.29	48.60	45.71
128	1024	32	qd8_f32_qb4w	30.29	33.55	34.90	25.77
		256	qd8_f32_qb4w	37.74	44.63	46.17	32.17
		NaN	qd8_f32_qc4w	38.91	44.62	49.21	44.86
4096	1024	32	qd8_f32_qb4w	29.71	32.69	34.89	25.58
		256	qd8_f32_qb4w	37.24	45.28	45.46	32.03
		NaN	qd8_f32_qc4w	38.83	46.76	47.19	44.68
11008	4096	32	qd8_f32_qb4w	27.13	32.23	33.80	25.78
		256	qd8_f32_qb4w	37.29	44.41	46.50	32.43
		NaN	qd8_f32_qc4w	36.02	45.84	48.99	44.29
32000	4096	32	qd8_f32_qb4w	19.77	26.41	30.11	25.09
		256	qd8_f32_qb4w	30.58	41.53	44.61	32.19
		NaN	qd8_f32_qc4w	29.53	40.96	45.74	44.25

Jun 24 '24 21:06 GregoryComer

XNNPACK XNNPACK copied to clipboard

QB4W AVX2 GEMM Kernels

XNNPACK
XNNPACK copied to clipboard