XNNPACK QB4W SIMD Packing for neondot

Summary

We add scalar, neondot, and neoni8mm for packing 4-bit blockwise quantized kernels for GEMM. These kernels are specifically only for 16c4 and 16c8 GEMMs. These kernels help provide significant uplift to packing performance and subsequently model load time performance. We productionize only neondot and neoni8mm kernels despite also writing scalar kernels. We add tests and benchmarks for blockwise packing routines. A key note about these kernels is that they do a one shot initialization of bias and scales as well, unlike existing reference kernels which just use the extra bytes field to leave wholes in the packed weights that are later filled by microparams-init functions, the new ukernels packs the weights with the bias and scales.

Impact

All performance metrics gathered are on SAMSUNG S24 device. Build modes and flags are all built through the scripts provided in scripts/build-android-arm64.sh.

Average of 10 Runs:

qb4_packw_x16c4_goi__reference/llm/B:1/M:128/N:4096/K:1024/real_time_mean                                4905074 ns      4887502 ns           10 bytes=1.77032G/s cpufreq=2.3808G elements=855.096M/s
qb4_packw_x16c8_goi__reference/llm/B:1/M:128/N:4096/K:1024/real_time_mean                                4320939 ns      4304560 ns           10 bytes=2.00964G/s cpufreq=2.3808G elements=970.693M/s
qb4_packw/xnn_qb4_packw_gemm_goi_ukernel_x16c8__scalar_llm/B:1/M:128/N:4096/K:1024/real_time_mean        1070324 ns      1066204 ns           10 bytes=8.11299G/s cpufreq=2.3808G elements=3.91873G/s
qb4_packw/xnn_qb4_packw_gemm_goi_ukernel_x16c4__scalar_llm/B:1/M:128/N:4096/K:1024/real_time_mean        1223852 ns      1219007 ns           10 bytes=7.09524G/s cpufreq=2.3808G elements=3.42713G/s
qb4_packw/xnn_qb4_packw_gemm_goi_ukernel_x16c4__neondot_llm/B:1/M:128/N:4096/K:1024/real_time_mean        342984 ns       341661 ns           10 bytes=25.3176G/s cpufreq=2.3808G elements=12.2289G/s
qb4_packw/xnn_qb4_packw_gemm_goi_ukernel_x16c8__neoni8mm_llm/B:1/M:128/N:4096/K:1024/real_time_mean       288537 ns       287416 ns           10 bytes=30.0951G/s cpufreq=2.3808G elements=14.5365G/s

In general we see 4x improvement from reference --> scalar and ~15x improvement from reference --> SIMD. Binary size impact of these changes:

   0.2%  11.5Ki   1.3%  11.4Ki    xnn_qb4_packw_gemm_goi_ukernel_x16c8__scalar
   0.1%  7.46Ki   0.9%  7.36Ki    xnn_qb4_packw_gemm_goi_ukernel_x16c4__scalar
   0.1%  5.42Ki   0.5%  4.09Ki    xnn_qb4_packw_gemm_goi_ukernel_x16c4__neondot
   0.1%  3.90Ki   0.4%  3.80Ki    xnn_qb4_packw_gemm_goi_ukernel_x16c8__neoni8mm

Scalar kernels significantly contribute to binary size due to the excessive loop unrolling, neondot and neoni8mm kernels are better.

Additional Experiment

In the commit titled Reduce Binary Size Impact, we experiment with removing the tail handling for when NC % NR != 0:

if XNN_UNLIKELY(n != 0){

Instead we move this check into the the loop body:

for (;n > 0; n -= 16) {

and just clamp the weights and scales if n < 16. This surprisingly didn't have a huge performance impact:

qb4_packw_x16c4_goi__reference/llm/B:1/M:128/N:4096/K:1024/real_time_mean                                4911195 ns      4893393 ns           10 bytes=1.76811G/s cpufreq=2.3808G elements=854.029M/s
qb4_packw_x16c8_goi__reference/llm/B:1/M:128/N:4096/K:1024/real_time_mean                                4324329 ns      4308086 ns           10 bytes=2.00806G/s cpufreq=2.3808G elements=969.932M/s
qb4_packw/xnn_qb4_packw_gemm_goi_ukernel_x16c8__scalar_llm/B:1/M:128/N:4096/K:1024/real_time_mean        1067370 ns      1063305 ns           10 bytes=8.13545G/s cpufreq=2.3808G elements=3.92957G/s
qb4_packw/xnn_qb4_packw_gemm_goi_ukernel_x16c4__scalar_llm/B:1/M:128/N:4096/K:1024/real_time_mean        1185798 ns      1181115 ns           10 bytes=7.32294G/s cpufreq=2.3808G elements=3.53712G/s
qb4_packw/xnn_qb4_packw_gemm_goi_ukernel_x16c4__neondot_llm/B:1/M:128/N:4096/K:1024/real_time_mean        342087 ns       340746 ns           10 bytes=25.3839G/s cpufreq=2.3808G elements=12.2609G/s
qb4_packw/xnn_qb4_packw_gemm_goi_ukernel_x16c8__neoni8mm_llm/B:1/M:128/N:4096/K:1024/real_time_mean       281989 ns       280882 ns           10 bytes=30.7939G/s cpufreq=2.3808G elements=14.874G/s

But allowed us to save binary size by 50%:

   0.1%  6.19Ki   0.7%  6.09Ki    xnn_qb4_packw_gemm_goi_ukernel_x16c8__scalar
   0.1%  4.06Ki   0.5%  3.96Ki    xnn_qb4_packw_gemm_goi_ukernel_x16c4__scalar
   0.0%  2.51Ki   0.3%  2.42Ki    xnn_qb4_packw_gemm_goi_ukernel_x16c4__neondot
   0.0%  2.38Ki   0.3%  2.28Ki    xnn_qb4_packw_gemm_goi_ukernel_x16c8__neoni8mm

Since neondot and neoni8mm are the only productionized kernels, I think the binary size impact seems tolerable here. This branch contains the changes introduced to reduce binary size.

Further Consierations

I believe the microkernel generation templates could be further generalized for the QC4 case, but due to some complexity of special handling for QB4, I've chosen to leave that for later

Would also like to follow up with multithreading these kernels. I've experimented and found that there can be some significant improvements in load time with this.

Edits

Updated the branch for neondot kernel to use vdot instedvpadd we saw the following improvement:

qb4_packw/xnn_qb4_packw_gemm_goi_ukernel_x16c4__neondot_llm/B:1/M:128/N:4096/K:1024/real_time_mean        307581 ns       306382 ns           10 bytes=28.2317G/s cpufreq=2.3808G elements=13.6364G/s

And

   0.3%  2.34Ki   0.3%  2.34Ki    _xnn_qb4_packw_gemm_goi_ukernel_x16c4__neondot

Apr 14 '25 10:04 mcr229

@alankelly @fbarchard for reviews

Apr 14 '25 10:04 mcr229

Force push changes:

resolved some nits in the ukernels
removed OOM-ing benchmarks
Use vdot instead of padd in neondot

Apr 14 '25 22:04 mcr229

@alankelly @dsharlet @fbarchard

Apr 14 '25 22:04 mcr229

New Changes:

remove $ABC from packw templates
move Neoni8mm kernel to be neondot, change mmlaq to vdot and do padd out side the loop
clang-format

@fbarchard for another review

Apr 17 '25 22:04 mcr229

Updates:

Updated kernels to support both signed and unsigned (zero_point = 0 & zero_point = 8)
Added comment for initializing bias in packing kernel
renamed templates to c4-neondot.in and c8-neondot.in

Performance Updates: Updating to support signed and unsigned seems to have a large effect on the scalar kernels. Performance regresses ~10%, and for some reason the scalar c4 kernel's binary size almost triples from 3.96 --> ~10(c8 scalar goes down). I couldn't figure out why. But since scalar kernels aren't productionized here I didn't let it go(i tried looking into it for some time but couldn't figure it out). SIMD Kernels don't have much change:

Binary Size

   0.1%  10.4Ki   1.2%  10.3Ki    xnn_qb4_packw_gemm_goi_ukernel_x16c4__scalar
   0.1%  5.71Ki   0.6%  5.61Ki    xnn_qb4_packw_gemm_goi_ukernel_x16c8__scalar
   0.1%  3.63Ki   0.3%  2.29Ki    xnn_qb4_packw_gemm_goi_ukernel_x16c4__neondot
   0.0%  2.42Ki   0.3%  2.32Ki    xnn_qb4_packw_gemm_goi_ukernel_x16c8__neondot

Perf

qb4_packw/xnn_qb4_packw_gemm_goi_ukernel_x16c8__scalar_llm/B:1/M:128/N:4096/K:1024/real_time_mean       1206842 ns      1202209 ns           10 bytes=7.19524G/s cpufreq=2.3808G elements=3.47544G/s
qb4_packw/xnn_qb4_packw_gemm_goi_ukernel_x16c4__scalar_llm/B:1/M:128/N:4096/K:1024/real_time_mean       1387612 ns      1382595 ns           10 bytes=6.25789G/s cpufreq=2.3808G elements=3.02268G/s
qb4_packw/xnn_qb4_packw_gemm_goi_ukernel_x16c4__neondot_llm/B:1/M:128/N:4096/K:1024/real_time_mean       291198 ns       290106 ns           10 bytes=29.82G/s cpufreq=2.3808G elements=14.4036G/s
qb4_packw/xnn_qb4_packw_gemm_goi_ukernel_x16c8__neondot_llm/B:1/M:128/N:4096/K:1024/real_time_mean       280933 ns       279850 ns           10 bytes=30.9096G/s cpufreq=2.3808G elements=14.9299G/s

cc. @fbarchard

Apr 25 '25 09:04 mcr229

@fbarchard @dsharletg @alankelly

please take a look if you folks get the chance

May 01 '25 20:05 mcr229

Rebasing

May 01 '25 21:05 mcr229

resolved remaining build failures

May 07 '25 02:05 mcr229

This is broken for us with many errors of the form:

src/qb4-packw/gen/qb4-packw-x16c4-gemm-goi-scalar.c:23:38: error: cast from 'const unsigned char *' to 'unsigned int *' drops const qualifier [-Werror,-Wcast-qual]
   23 |   const uint32_t s_v0 = (((uint32_t*)weights)[0] ^ vkernel_zero_point);

May 07 '25 20:05 dsharlet

This is broken for us with many errors of the form:

src/qb4-packw/gen/qb4-packw-x16c4-gemm-goi-scalar.c:23:38: error: cast from 'const unsigned char *' to 'unsigned int *' drops const qualifier [-Werror,-Wcast-qual]
   23 |   const uint32_t s_v0 = (((uint32_t*)weights)[0] ^ vkernel_zero_point);

let me fix that, thanks for the catch.

May 07 '25 21:05 mcr229

There are build warnings on the scalar kernel? third_party/XNNPACK/src/qb4-packw/gen/qb4-packw-x16c4-gemm-goi-scalar.c:23:38: error: cast from 'const unsigned char *' to 'unsigned int ' drops const qualifier [-Werror,-Wcast-qual] 23 | const uint32_t s_v0 = (((uint32_t)weights)[0] ^ vkernel_zero_point); | ^ third_party/XNNPACK/src/qb4-packw/gen/qb4-packw-x16c4-gemm-goi-scalar.c:24:32: error: cast from 'const unsigned int *' to 'signed char ' drops const qualifier [-Werror,-Wcast-qual] 24 | const int8_t v0 = (((int8_t)&s_v0))[0] << 4; |

May 08 '25 17:05 fbarchard

This is still not building:

XNNPACK/src/qb4-packw/gen/qb4-packw-x16c8-gemm-goi-aarch64-neondot.c:71:19: error: unused variable 'veor_mask' [-Werror,-Wunused-variable]
   71 |   const int8x16_t veor_mask = vmovq_n_s8(UINT8_C(0x88));
      |                   ^~~~~~~~~
XNNPACK/src/qb4-packw/gen/qb4-packw-x16c8-gemm-goi-aarch64-neondot.c:72:19: error: unused variable 'neg_zp' [-Werror,-Wunused-variable]
   72 |   const int32x4_t neg_zp = vmovq_n_s32(-64);
      |                   ^~~~~~
XNNPACK/src/qb4-packw/gen/qb4-packw-x16c8-gemm-goi-aarch64-neondot.c:74:18: error: unused variable 'vzeros' [-Werror,-Wunused-variable]
   74 |   const int8x8_t vzeros = vmov_n_s8(0);

May 09 '25 06:05 dsharlet