QB4W SIMD Packing for neondot
Summary
We add scalar, neondot, and neoni8mm for packing 4-bit blockwise quantized kernels for GEMM. These kernels are specifically only for 16c4 and 16c8 GEMMs. These kernels help provide significant uplift to packing performance and subsequently model load time performance. We productionize only neondot and neoni8mm kernels despite also writing scalar kernels. We add tests and benchmarks for blockwise packing routines. A key note about these kernels is that they do a one shot initialization of bias and scales as well, unlike existing reference kernels which just use the extra bytes field to leave wholes in the packed weights that are later filled by microparams-init functions, the new ukernels packs the weights with the bias and scales.
Impact
All performance metrics gathered are on SAMSUNG S24 device. Build modes and flags are all built through the scripts provided in scripts/build-android-arm64.sh.
Average of 10 Runs:
qb4_packw_x16c4_goi__reference/llm/B:1/M:128/N:4096/K:1024/real_time_mean 4905074 ns 4887502 ns 10 bytes=1.77032G/s cpufreq=2.3808G elements=855.096M/s
qb4_packw_x16c8_goi__reference/llm/B:1/M:128/N:4096/K:1024/real_time_mean 4320939 ns 4304560 ns 10 bytes=2.00964G/s cpufreq=2.3808G elements=970.693M/s
qb4_packw/xnn_qb4_packw_gemm_goi_ukernel_x16c8__scalar_llm/B:1/M:128/N:4096/K:1024/real_time_mean 1070324 ns 1066204 ns 10 bytes=8.11299G/s cpufreq=2.3808G elements=3.91873G/s
qb4_packw/xnn_qb4_packw_gemm_goi_ukernel_x16c4__scalar_llm/B:1/M:128/N:4096/K:1024/real_time_mean 1223852 ns 1219007 ns 10 bytes=7.09524G/s cpufreq=2.3808G elements=3.42713G/s
qb4_packw/xnn_qb4_packw_gemm_goi_ukernel_x16c4__neondot_llm/B:1/M:128/N:4096/K:1024/real_time_mean 342984 ns 341661 ns 10 bytes=25.3176G/s cpufreq=2.3808G elements=12.2289G/s
qb4_packw/xnn_qb4_packw_gemm_goi_ukernel_x16c8__neoni8mm_llm/B:1/M:128/N:4096/K:1024/real_time_mean 288537 ns 287416 ns 10 bytes=30.0951G/s cpufreq=2.3808G elements=14.5365G/s
In general we see 4x improvement from reference --> scalar and ~15x improvement from reference --> SIMD. Binary size impact of these changes:
0.2% 11.5Ki 1.3% 11.4Ki xnn_qb4_packw_gemm_goi_ukernel_x16c8__scalar
0.1% 7.46Ki 0.9% 7.36Ki xnn_qb4_packw_gemm_goi_ukernel_x16c4__scalar
0.1% 5.42Ki 0.5% 4.09Ki xnn_qb4_packw_gemm_goi_ukernel_x16c4__neondot
0.1% 3.90Ki 0.4% 3.80Ki xnn_qb4_packw_gemm_goi_ukernel_x16c8__neoni8mm
Scalar kernels significantly contribute to binary size due to the excessive loop unrolling, neondot and neoni8mm kernels are better.
Additional Experiment
In the commit titled Reduce Binary Size Impact, we experiment with removing the tail handling for when NC % NR != 0:
if XNN_UNLIKELY(n != 0){
Instead we move this check into the the loop body:
for (;n > 0; n -= 16) {
and just clamp the weights and scales if n < 16. This surprisingly didn't have a huge performance impact:
qb4_packw_x16c4_goi__reference/llm/B:1/M:128/N:4096/K:1024/real_time_mean 4911195 ns 4893393 ns 10 bytes=1.76811G/s cpufreq=2.3808G elements=854.029M/s
qb4_packw_x16c8_goi__reference/llm/B:1/M:128/N:4096/K:1024/real_time_mean 4324329 ns 4308086 ns 10 bytes=2.00806G/s cpufreq=2.3808G elements=969.932M/s
qb4_packw/xnn_qb4_packw_gemm_goi_ukernel_x16c8__scalar_llm/B:1/M:128/N:4096/K:1024/real_time_mean 1067370 ns 1063305 ns 10 bytes=8.13545G/s cpufreq=2.3808G elements=3.92957G/s
qb4_packw/xnn_qb4_packw_gemm_goi_ukernel_x16c4__scalar_llm/B:1/M:128/N:4096/K:1024/real_time_mean 1185798 ns 1181115 ns 10 bytes=7.32294G/s cpufreq=2.3808G elements=3.53712G/s
qb4_packw/xnn_qb4_packw_gemm_goi_ukernel_x16c4__neondot_llm/B:1/M:128/N:4096/K:1024/real_time_mean 342087 ns 340746 ns 10 bytes=25.3839G/s cpufreq=2.3808G elements=12.2609G/s
qb4_packw/xnn_qb4_packw_gemm_goi_ukernel_x16c8__neoni8mm_llm/B:1/M:128/N:4096/K:1024/real_time_mean 281989 ns 280882 ns 10 bytes=30.7939G/s cpufreq=2.3808G elements=14.874G/s
But allowed us to save binary size by 50%:
0.1% 6.19Ki 0.7% 6.09Ki xnn_qb4_packw_gemm_goi_ukernel_x16c8__scalar
0.1% 4.06Ki 0.5% 3.96Ki xnn_qb4_packw_gemm_goi_ukernel_x16c4__scalar
0.0% 2.51Ki 0.3% 2.42Ki xnn_qb4_packw_gemm_goi_ukernel_x16c4__neondot
0.0% 2.38Ki 0.3% 2.28Ki xnn_qb4_packw_gemm_goi_ukernel_x16c8__neoni8mm
Since neondot and neoni8mm are the only productionized kernels, I think the binary size impact seems tolerable here. This branch contains the changes introduced to reduce binary size.
Further Consierations
I believe the microkernel generation templates could be further generalized for the QC4 case, but due to some complexity of special handling for QB4, I've chosen to leave that for later
Would also like to follow up with multithreading these kernels. I've experimented and found that there can be some significant improvements in load time with this.
Edits
Updated the branch for neondot kernel to use vdot instedvpadd we saw the following improvement:
qb4_packw/xnn_qb4_packw_gemm_goi_ukernel_x16c4__neondot_llm/B:1/M:128/N:4096/K:1024/real_time_mean 307581 ns 306382 ns 10 bytes=28.2317G/s cpufreq=2.3808G elements=13.6364G/s
And
0.3% 2.34Ki 0.3% 2.34Ki _xnn_qb4_packw_gemm_goi_ukernel_x16c4__neondot
@alankelly @fbarchard for reviews
Force push changes:
- resolved some nits in the ukernels
- removed OOM-ing benchmarks
- Use vdot instead of padd in neondot
@alankelly @dsharlet @fbarchard
New Changes:
- remove $ABC from packw templates
- move Neoni8mm kernel to be neondot, change mmlaq to vdot and do padd out side the loop
- clang-format
@fbarchard for another review
Updates:
- Updated kernels to support both signed and unsigned (zero_point = 0 & zero_point = 8)
- Added comment for initializing bias in packing kernel
- renamed templates to c4-neondot.in and c8-neondot.in
Performance Updates: Updating to support signed and unsigned seems to have a large effect on the scalar kernels. Performance regresses ~10%, and for some reason the scalar c4 kernel's binary size almost triples from 3.96 --> ~10(c8 scalar goes down). I couldn't figure out why. But since scalar kernels aren't productionized here I didn't let it go(i tried looking into it for some time but couldn't figure it out). SIMD Kernels don't have much change:
Binary Size
0.1% 10.4Ki 1.2% 10.3Ki xnn_qb4_packw_gemm_goi_ukernel_x16c4__scalar
0.1% 5.71Ki 0.6% 5.61Ki xnn_qb4_packw_gemm_goi_ukernel_x16c8__scalar
0.1% 3.63Ki 0.3% 2.29Ki xnn_qb4_packw_gemm_goi_ukernel_x16c4__neondot
0.0% 2.42Ki 0.3% 2.32Ki xnn_qb4_packw_gemm_goi_ukernel_x16c8__neondot
Perf
qb4_packw/xnn_qb4_packw_gemm_goi_ukernel_x16c8__scalar_llm/B:1/M:128/N:4096/K:1024/real_time_mean 1206842 ns 1202209 ns 10 bytes=7.19524G/s cpufreq=2.3808G elements=3.47544G/s
qb4_packw/xnn_qb4_packw_gemm_goi_ukernel_x16c4__scalar_llm/B:1/M:128/N:4096/K:1024/real_time_mean 1387612 ns 1382595 ns 10 bytes=6.25789G/s cpufreq=2.3808G elements=3.02268G/s
qb4_packw/xnn_qb4_packw_gemm_goi_ukernel_x16c4__neondot_llm/B:1/M:128/N:4096/K:1024/real_time_mean 291198 ns 290106 ns 10 bytes=29.82G/s cpufreq=2.3808G elements=14.4036G/s
qb4_packw/xnn_qb4_packw_gemm_goi_ukernel_x16c8__neondot_llm/B:1/M:128/N:4096/K:1024/real_time_mean 280933 ns 279850 ns 10 bytes=30.9096G/s cpufreq=2.3808G elements=14.9299G/s
cc. @fbarchard
@fbarchard @dsharletg @alankelly
please take a look if you folks get the chance
Rebasing
resolved remaining build failures
This is broken for us with many errors of the form:
src/qb4-packw/gen/qb4-packw-x16c4-gemm-goi-scalar.c:23:38: error: cast from 'const unsigned char *' to 'unsigned int *' drops const qualifier [-Werror,-Wcast-qual]
23 | const uint32_t s_v0 = (((uint32_t*)weights)[0] ^ vkernel_zero_point);
This is broken for us with many errors of the form:
src/qb4-packw/gen/qb4-packw-x16c4-gemm-goi-scalar.c:23:38: error: cast from 'const unsigned char *' to 'unsigned int *' drops const qualifier [-Werror,-Wcast-qual] 23 | const uint32_t s_v0 = (((uint32_t*)weights)[0] ^ vkernel_zero_point);
let me fix that, thanks for the catch.
There are build warnings on the scalar kernel? third_party/XNNPACK/src/qb4-packw/gen/qb4-packw-x16c4-gemm-goi-scalar.c:23:38: error: cast from 'const unsigned char *' to 'unsigned int ' drops const qualifier [-Werror,-Wcast-qual] 23 | const uint32_t s_v0 = (((uint32_t)weights)[0] ^ vkernel_zero_point); | ^ third_party/XNNPACK/src/qb4-packw/gen/qb4-packw-x16c4-gemm-goi-scalar.c:24:32: error: cast from 'const unsigned int *' to 'signed char ' drops const qualifier [-Werror,-Wcast-qual] 24 | const int8_t v0 = (((int8_t)&s_v0))[0] << 4; |
This is still not building:
XNNPACK/src/qb4-packw/gen/qb4-packw-x16c8-gemm-goi-aarch64-neondot.c:71:19: error: unused variable 'veor_mask' [-Werror,-Wunused-variable]
71 | const int8x16_t veor_mask = vmovq_n_s8(UINT8_C(0x88));
| ^~~~~~~~~
XNNPACK/src/qb4-packw/gen/qb4-packw-x16c8-gemm-goi-aarch64-neondot.c:72:19: error: unused variable 'neg_zp' [-Werror,-Wunused-variable]
72 | const int32x4_t neg_zp = vmovq_n_s32(-64);
| ^~~~~~
XNNPACK/src/qb4-packw/gen/qb4-packw-x16c8-gemm-goi-aarch64-neondot.c:74:18: error: unused variable 'vzeros' [-Werror,-Wunused-variable]
74 | const int8x8_t vzeros = vmov_n_s8(0);