quanto icon indicating copy to clipboard operation
quanto copied to clipboard

Investigate: pack densely scale+shift tensors into the weight tensors for highly quantize tensors

Open maruel opened this issue 6 months ago • 0 comments

Context: https://x.com/marcaruel/status/1818265542442066307

Code ref: https://github.com/huggingface/optimum-quanto/blob/main/optimum/quanto/tensor/qbits/qbits.py#L146

optimum-quanto's QBitsTensor uses tensors for scale and shift, paced at group_size blocks. It's great but there are two issues:

  • The alpha and bias (scale and shift) are not closely located with their corresponding blocks. This loses on memory locality which likely hurts performance due to more random access in the core loop than otherwise necessary.
  • There's only one scale factor, instead of using super blocks (e.g. one scale factor every 8 weights, then one every 8 blocks) to increase the dynamic range precision. I believe this would improve the effective precision at a very low cost, an increase of around 1/64 (1.6%) in the amount of data and imperceptible performance speed if the inner loop is well coded.

For precedent, see the super block quantization unit that llama.cpp has been using for more than a year, that gives high performance with high cache locality. https://github.com/ggerganov/llama.cpp/blob/master/ggml/src/ggml-common.h#L280

IMO, you should aim to reproduce the same thing in python.

maruel avatar Aug 01 '24 15:08 maruel