quanto
quanto copied to clipboard
Investigate: pack densely scale+shift tensors into the weight tensors for highly quantize tensors
Context: https://x.com/marcaruel/status/1818265542442066307
Code ref: https://github.com/huggingface/optimum-quanto/blob/main/optimum/quanto/tensor/qbits/qbits.py#L146
optimum-quanto's QBitsTensor uses tensors for scale and shift, paced at group_size blocks. It's great but there are two issues:
- The alpha and bias (scale and shift) are not closely located with their corresponding blocks. This loses on memory locality which likely hurts performance due to more random access in the core loop than otherwise necessary.
- There's only one scale factor, instead of using super blocks (e.g. one scale factor every 8 weights, then one every 8 blocks) to increase the dynamic range precision. I believe this would improve the effective precision at a very low cost, an increase of around 1/64 (1.6%) in the amount of data and imperceptible performance speed if the inner loop is well coded.
For precedent, see the super block quantization unit that llama.cpp has been using for more than a year, that gives high performance with high cache locality. https://github.com/ggerganov/llama.cpp/blob/master/ggml/src/ggml-common.h#L280
IMO, you should aim to reproduce the same thing in python.