[RFC][NOT_FOR_COMMIT] 4-bit Weight Blockwise Quantization for qd8-f[16,32]
4-bit Weight Blockwise Quantization for qd8-f[16,32]
Introduction
This proposal aims to explore the implementation of blockwise quantization for 4-bit weights for qd8-f16 and qd8-f32 in XNNPACK, a concept that is particularly beneficial for Low Latency Models (LLM) and Transformers. I believe, this was initially introduced in a research paper and has been implemented in slightly different forms across other frameworks. For the XNNPACK backend in PyTorch, we propose to investigate this concept for LLMs running on CPUs.
At a high level this can be seen as an extension of the per_channel weight quantization (qc4w) with multiple scales per weight output channels instead of one, based on the blocksize.
Objective
The primary aim of this proposal is to get early feedback using this proof-of-concept PR for the blockwise quantization for the fully-connected operator. This type of quantization is static (weight only) and is planned to be used with f16 or f32 dynamically quantized fully-connected operators with 4-bit weights.
This proposal is not intended for a detailed review or merging at this stage. There are a lot of design decisions we need to make for this and for other similar quantization schemas likely to come up in the future. This RFC PR is for sharing the technical design aspects and using it as a baseline for effective design review and discussion.
I am hoping to get feedback around packing and precalculations, kernel designs, API design (operator and subgraph), and naming convention. I also hope to discuss a potential route to merge a version of this which we all agree upon to align XNNPACK with our timeline.
Approach
WEIGHT PACKING
We propose a new weight packing routine that packs scale for a block in between 4-bit weights. Additionally, we introduce another packed weight update routine to fill-in the scales for each block after the weights have been packed.
KERNEL
A minimally viable 1x2 scalar kernel is added for experimentation. It also illustrates a high level algorithm an optimized kernel might implement. We anticipate further development and refinement of next set of kernel based on feedback and suggestions without major changes in packing or higher level APIs.
OPERATOR
We propose the addition of a new operator with a qb4w suffix, where b stands for blockwise. We also suggest the introduction of xnn_datatype_qbint4 as a new data type, and xnn_operator_type_fully_connected_nc_f32_qb4w as a new operator type. The block size will be passed down in xnn_create_fully_connected_nc_qd8_f32_qb4w.
SUBGRAPH
A new tensor API xnn_define_blockwise_quantized_tensor_value with block_size and 2D scale parameters is introduced. The fully-connected subgraph operator will be updated to call the corresponding qb4w operator APIs.
Limitations and Future Work
This proof-of-concept has certain known limitations, such as the lack of support for transposed weights or vectorized kernels. However, we believe that by taking a depth-first approach and getting a working end-to-end PR for RFC, we may iterate on it much faster than backtracking and updating frequently.
Added a single 1x16c4 neon dotprod kernel for experimentation. Haven't added benchmarks yet for qb4w nor spent time looking at any performance yet.
pushed v7
pushed v8 - Add qd8-f16-qb4w support for early experimentation.
Closing this RFC PR as we ramp up on developing qb4w as a feature. We will be upstreaming it in forms of multiple smaller PRs.