hqq
hqq copied to clipboard
How to accelerate the inference speed of 1bit+lora model
Because it's so slow, 34b model 1bit+lora is about 1token/s
Yeah, we are actively looking for ways to speed it up, but it might take a while, because re-using available kernels is not fully compatible with HQQ's logic.
🥺
We have accelerated inference for 4-bit now: https://github.com/mobiusml/hqq/?tab=readme-ov-file#backend 1-bit acceleration is the holy grail and is work in progress.
You can use the bitblas backend with 2-bit. You can follow this example: https://github.com/mobiusml/hqq/blob/master/examples/backends/bitblas_int4_demo.py