mobicham comments

Results 113 comments of


                                            mobicham

Does Marlin support zero-point quantization?

Thanks for the amazing work @efrantar ! Regarding the zero-point, it is actually very important to have it especially at low-bits. In fact, the zero-point is _more_ important than the...

How to accelerate the inference speed of 1bit+lora model

Yeah, we are actively looking for ways to speed it up, but it might take a while, because re-using available kernels is not fully compatible with HQQ's logic.

How to accelerate the inference speed of 1bit+lora model

We have accelerated inference for 4-bit now: https://github.com/mobiusml/hqq/?tab=readme-ov-file#backend 1-bit acceleration is the holy grail and is work in progress.

How to accelerate the inference speed of 1bit+lora model

You can use the bitblas backend with 2-bit. You can follow this example: https://github.com/mobiusml/hqq/blob/master/examples/backends/bitblas_int4_demo.py

AttributeError: 'LlamaForCausalLM' object has no attribute '_setup_cache'

> Static Cache was moved to be a standalone object in #30476. You have to init StaticCache outside the model and pass in every forward call, similar to following: Thanks,...

Quantization support for heads and embeddings

I run lm-eval with `Llama3-8B-Instruct`, quantizing the lm-head and the embeddings with HQQ, 4-bit, group-size=64. It's working fine (or even better :D ). ![llama3_results](https://github.com/huggingface/transformers/assets/37179323/94496af2-070d-48c5-a66b-cd3a7c4d5781) Here's how to quantize the embedding...

mobicham

Does Marlin support zero-point quantization?

How to accelerate the inference speed of 1bit+lora model

How to accelerate the inference speed of 1bit+lora model

How to accelerate the inference speed of 1bit+lora model

AttributeError: 'LlamaForCausalLM' object has no attribute '_setup_cache'

Quantization support for heads and embeddings

Quantization support for heads and embeddings

Feature Request: Implement Static Cache and Quantization Techniques in CTranslate2

Feature Request: Implement Static Cache and Quantization Techniques in CTranslate2

Feature Request: Implement Static Cache and Quantization Techniques in CTranslate2