hqq
hqq copied to clipboard
Official implementation of Half-Quadratic Quantization (HQQ)
HQQ multi-gpu support is so far only supported for the `quantize_model` model call.
Because it's so slow, 34b model 1bit+lora is about 1token/s
I installed bitblas with 'pip install bitblas' However, running an example shows the following warning: 'Warning: failed to import the BitBlas backend. Check if BitBlas is correctly installed if you...
I'm trying to quantize 405b, but then I'm unable to upload it to HF since it's ~200GB and HF LFS has a limit on 50GB file sizes. Is there a...
Thanks for your work, may I ask when you expect to implement the .cpu() method of HQQLinear? Or can you please briefly describe how to implement it, I can implement...
Because `quant_config` is gone when you load model using `from_quantized`. I tried to re-add the `quant_config` here so then when we call `prepare_for_inference` for loaded quantized model, it will not...
Can activation quantization also be introduced in Hqq as well? Or if not, is there any process/method can further quantize the activation after using Hqq to quantize the weight?
Code: ```python import torch from whisperplus.pipelines.whisper import SpeechToTextPipeline from transformers import HqqConfig audio_path = "test.mp3" q4_config = {'nbits':4, 'group_size':64, 'quant_zero':False, 'quant_scale':False} q3_config = {'nbits':3, 'group_size':32, 'quant_zero':False, 'quant_scale':False} quant_config = HqqConfig(dynamic_config={...
Get model from https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib HQQ installed according to instructions and tried running the sample given on HF site. After downloading the model, the execution fails on a CUDA error. ```Traceback...
Is there an easy way to convert gguf to hqq and vice-versa? Any comparisons? https://github.com/leafspark/AutoGGUF