XNNPACK
XNNPACK copied to clipboard
Support RVV x32-packw
Goal
Enable x32-packw to speed up dynamic fully connected layer for LLM model.
Background
GEMM u-kernel uses input and packed_weight(weight and bias) to calculate output value. Our GEMM implementations use LMUL & VLEN to determine Nr size. This PR is to provide RVV x32-packw implementations to speed up packing.
XNNPACK originally provided xnn_pack_f32_gemm_goi_w
& xnn_pack_f32_gemm_gio_w
to preprocess static weight in offline. However, the language models usually use GEMM with dynamic weight.
To speed up the packing process, XNNPACK provides x32-packw u-kernels.
x32-packw aims to pack weight(col-major or OI) & bias into packed_weight buffers.
Parameters
There are two parameters NR
& KBlock
for x32-packw.
NR
is determined by VLEN & LMUL. If VLEN=512 & LMUL=4, NR = 64.
KBlock
is to determine the largest rows to transpose in a single iteration.
The image above is an example of NR=8 & KBlock=2.
X32-packw naming
RVV naming: x${LMUL}v_u${KBLOCK}
Others naming: x${NR}v_u${KBLOCK}
Hi @fbarchard @alankelly This PR is to support rvv x32-packw. If you have free time, please help to review.
Could you use a strided load to read each vector with a single instruction?
Using strided segment load can have better performance.
packw-x2v means a 4x2v gemm kernel would use this packing.
Yes, I am on the same page.
The correct solution is probably to branch the code, but the main loop will do a multiple of 4 bytes, so only remainder code needs to handle KC of 1 to 3.
Could you guide me where to find it?
For the x8-packw that calls x32-packw I made a hack PR https://github.com/google/XNNPACK/pull/6356 where you can see the idea. But I think instead of calling a common function, it will need a custom x8-packw that does 4 bytes at a time in the main loop, but handles KC remainder
and most of the 8 bit packing functions need to per channel sum... in the packing.c it sums up weights for each NR if (kc_idx < kc) { const int8_t kv = k[(nr_block_start + nr_block_offset) * kc + kc_idx]; ksum += (uint32_t) kv; ((int8_t*) packed_weights)[kr_block_offset] = kv; } unaligned_indexed_store_u32(packed_b, nr_block_offset, unaligned_indexed_load_u32(packed_b, nr_block_offset) - ksum * izp); And then adjusts the bias by the sum * the input zero point, which is a parameter.
The current x8-packw is for f32-qc8w-gemm which doesnt need the sum, and I've only done scalar.
Can you please rebase and I will land first thing on Monday?
Hi @fbarchard I got the idea.
But I think instead of calling a common function, it will need a custom x8-packw that does 4 bytes at a time in the main loop, but handles KC remainder
Adding the code to tackle on tail part (KC remainder) could be a better idea. Also, we need to concern about unaligned memory access problem on some architectures.
@alankelly I've rebased it.
Thanks this will land today