How to accelerate the inference speed of 1bit+lora model

Open Minami-su opened this issue 1 year ago • 3 comments

Because it's so slow, 34b model 1bit+lora is about 1token/s

Apr 04 '24 05:04 Minami-su

Yeah, we are actively looking for ways to speed it up, but it might take a while, because re-using available kernels is not fully compatible with HQQ's logic.

Apr 04 '24 07:04 mobicham

🥺

May 02 '24 10:05 Minami-su

We have accelerated inference for 4-bit now: https://github.com/mobiusml/hqq/?tab=readme-ov-file#backend 1-bit acceleration is the holy grail and is work in progress.

May 02 '24 10:05 mobicham

You can use the bitblas backend with 2-bit. You can follow this example: https://github.com/mobiusml/hqq/blob/master/examples/backends/bitblas_int4_demo.py

Aug 28 '24 10:08 mobicham