mobicham
mobicham
Thanks for the amazing work @efrantar ! Regarding the zero-point, it is actually very important to have it especially at low-bits. In fact, the zero-point is _more_ important than the...
Yeah, we are actively looking for ways to speed it up, but it might take a while, because re-using available kernels is not fully compatible with HQQ's logic.
We have accelerated inference for 4-bit now: https://github.com/mobiusml/hqq/?tab=readme-ov-file#backend 1-bit acceleration is the holy grail and is work in progress.
You can use the bitblas backend with 2-bit. You can follow this example: https://github.com/mobiusml/hqq/blob/master/examples/backends/bitblas_int4_demo.py
> Static Cache was moved to be a standalone object in #30476. You have to init StaticCache outside the model and pass in every forward call, similar to following: Thanks,...
I run lm-eval with `Llama3-8B-Instruct`, quantizing the lm-head and the embeddings with HQQ, 4-bit, group-size=64. It's working fine (or even better :D ).  Here's how to quantize the embedding...
@Orion-Zheng currently, that's not the case for `lm_head`, because the fused gemv dequantizes the weights on-the-fly. However, it is the case for the Embedding layer. However, it's possible to have...
@minhthuc2502 we use the int4mm kernel from torchao: https://github.com/mobiusml/hqq/blob/master/hqq/backends/torchao.py
The conversion is only done once, different int4 kernels require different input formats, so that's why we do it via patching so we can support many backends, not just the...
@minhthuc2502 you have to quantize with hqq. The link you shared is just doing RTN quantization, which will give bad quality especially at lower bits.