hqq issues

Add multi-gpu support for `from_quantized` call

HQQ multi-gpu support is so far only supported for the `quantize_model` model call.

mobicham

enhancement

How to accelerate the inference speed of 1bit+lora model

3

Because it's so slow, 34b model 1bit+lora is about 1token/s

Minami-su

enhancement

Warning: failed to import the BitBlas backend

7

I installed bitblas with 'pip install bitblas' However, running an example shows the following warning: 'Warning: failed to import the BitBlas backend. Check if BitBlas is correctly installed if you...

jinz2014

Weight Sharding

1

I'm trying to quantize 405b, but then I'm unable to upload it to HF since it's ~200GB and HF LFS has a limit on 50GB file sizes. Is there a...

winglian

About the implentation of .cpu()

1

Thanks for your work, may I ask when you expect to implement the .cpu() method of HQQLinear? Or can you please briefly describe how to implement it, I can implement...

reflectionie

Add way to save quantize config and can be loaded again

7

Because `quant_config` is gone when you load model using `from_quantized`. I tried to re-add the `quant_config` here so then when we call `prepare_for_inference` for loaded quantized model, it will not...

fahadh4ilyas

Activation quantization

9

Can activation quantization also be introduced in Hqq as well? Or if not, is there any process/method can further quantize the activation after using Hqq to quantize the weight?

kaizizzzzzz

KeyError: 'offload_meta'

1

Code: ```python import torch from whisperplus.pipelines.whisper import SpeechToTextPipeline from transformers import HqqConfig audio_path = "test.mp3" q4_config = {'nbits':4, 'group_size':64, 'quant_zero':False, 'quant_scale':False} q3_config = {'nbits':3, 'group_size':32, 'quant_zero':False, 'quant_scale':False} quant_config = HqqConfig(dynamic_config={...

kadirnar

CUDA error when trying to use llama3.1 8B 4bit quantized model sample

5

Get model from https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib HQQ installed according to instructions and tried running the sample given on HF site. After downloading the model, the execution fails on a CUDA error. ```Traceback...

PatrickDahlin

Hqq vs gguf

3

Is there an easy way to convert gguf to hqq and vice-versa? Any comparisons? https://github.com/leafspark/AutoGGUF

blap

hqq
hqq copied to clipboard

Metadata

Add multi-gpu support for `from_quantized` call

How to accelerate the inference speed of 1bit+lora model

Warning: failed to import the BitBlas backend

Weight Sharding

About the implentation of .cpu()

Add way to save quantize config and can be loaded again

Activation quantization

KeyError: 'offload_meta'

CUDA error when trying to use llama3.1 8B 4bit quantized model sample

Hqq vs gguf

← Metadata

Owner

Metadata

hqq hqq copied to clipboard

Metadata

← Metadata

Owner

Metadata

hqq
hqq copied to clipboard