XNNPACK Support RVV x32-packw

Goal

Enable x32-packw to speed up dynamic fully connected layer for LLM model.

Background

GEMM u-kernel uses input and packed_weight(weight and bias) to calculate output value. Our GEMM implementations use LMUL & VLEN to determine Nr size. This PR is to provide RVV x32-packw implementations to speed up packing.

XNNPACK originally provided xnn_pack_f32_gemm_goi_w & xnn_pack_f32_gemm_gio_w to preprocess static weight in offline. However, the language models usually use GEMM with dynamic weight. To speed up the packing process, XNNPACK provides x32-packw u-kernels.

x32-packw aims to pack weight(col-major or OI) & bias into packed_weight buffers.

Parameters

There are two parameters NR & KBlock for x32-packw. NR is determined by VLEN & LMUL. If VLEN=512 & LMUL=4, NR = 64. KBlock is to determine the largest rows to transpose in a single iteration.

The image above is an example of NR=8 & KBlock=2.

X32-packw naming

RVV naming: x${LMUL}v_u${KBLOCK} Others naming: x${NR}v_u${KBLOCK}

Apr 12 '24 08:04 bhbruce

Hi @fbarchard @alankelly This PR is to support rvv x32-packw. If you have free time, please help to review.

Apr 12 '24 08:04 bhbruce

Could you use a strided load to read each vector with a single instruction?

Using strided segment load can have better performance.

packw-x2v means a 4x2v gemm kernel would use this packing.

Yes, I am on the same page.

The correct solution is probably to branch the code, but the main loop will do a multiple of 4 bytes, so only remainder code needs to handle KC of 1 to 3.

Could you guide me where to find it?

Apr 26 '24 07:04 bhbruce

For the x8-packw that calls x32-packw I made a hack PR https://github.com/google/XNNPACK/pull/6356 where you can see the idea. But I think instead of calling a common function, it will need a custom x8-packw that does 4 bytes at a time in the main loop, but handles KC remainder

and most of the 8 bit packing functions need to per channel sum... in the packing.c it sums up weights for each NR if (kc_idx < kc) { const int8_t kv = k[(nr_block_start + nr_block_offset) * kc + kc_idx]; ksum += (uint32_t) kv; ((int8_t*) packed_weights)[kr_block_offset] = kv; } unaligned_indexed_store_u32(packed_b, nr_block_offset, unaligned_indexed_load_u32(packed_b, nr_block_offset) - ksum * izp); And then adjusts the bias by the sum * the input zero point, which is a parameter.

The current x8-packw is for f32-qc8w-gemm which doesnt need the sum, and I've only done scalar.

May 02 '24 00:05 fbarchard

Can you please rebase and I will land first thing on Monday?

May 10 '24 11:05 alankelly

Hi @fbarchard I got the idea.

But I think instead of calling a common function, it will need a custom x8-packw that does 4 bytes at a time in the main loop, but handles KC remainder

Adding the code to tackle on tail part (KC remainder) could be a better idea. Also, we need to concern about unaligned memory access problem on some architectures.

May 11 '24 01:05 bhbruce

@alankelly I've rebased it.

May 11 '24 01:05 bhbruce

Thanks this will land today

May 13 '24 08:05 alankelly

XNNPACK XNNPACK copied to clipboard

Support RVV x32-packw

Goal

Background

Parameters

X32-packw naming

XNNPACK
XNNPACK copied to clipboard