mobicham

Results 113 comments of mobicham

Thanks for the amazing work @efrantar ! Regarding the zero-point, it is actually very important to have it especially at low-bits. In fact, the zero-point is _more_ important than the...

Yeah, we are actively looking for ways to speed it up, but it might take a while, because re-using available kernels is not fully compatible with HQQ's logic.

We have accelerated inference for 4-bit now: https://github.com/mobiusml/hqq/?tab=readme-ov-file#backend 1-bit acceleration is the holy grail and is work in progress.

You can use the bitblas backend with 2-bit. You can follow this example: https://github.com/mobiusml/hqq/blob/master/examples/backends/bitblas_int4_demo.py

> Static Cache was moved to be a standalone object in #30476. You have to init StaticCache outside the model and pass in every forward call, similar to following: Thanks,...

I run lm-eval with `Llama3-8B-Instruct`, quantizing the lm-head and the embeddings with HQQ, 4-bit, group-size=64. It's working fine (or even better :D ). ![llama3_results](https://github.com/huggingface/transformers/assets/37179323/94496af2-070d-48c5-a66b-cd3a7c4d5781) Here's how to quantize the embedding...

@Orion-Zheng currently, that's not the case for `lm_head`, because the fused gemv dequantizes the weights on-the-fly. However, it is the case for the Embedding layer. However, it's possible to have...

@minhthuc2502 we use the int4mm kernel from torchao: https://github.com/mobiusml/hqq/blob/master/hqq/backends/torchao.py

The conversion is only done once, different int4 kernels require different input formats, so that's why we do it via patching so we can support many backends, not just the...

@minhthuc2502 you have to quantize with hqq. The link you shared is just doing RTN quantization, which will give bad quality especially at lower bits.