mobicham
mobicham
@minhthuc2502 yes, it's not performing optimization: https://github.com/mobiusml/hqq/blob/master/hqq/core/quantize.py#L115-L122 , which is the actual hqq algo: https://github.com/mobiusml/hqq/blob/master/hqq/core/optimize.py#L194-L243 Basically, you get an initial estimate of the quantized weights/scale/zero (which is what you did),...
Glad to hear it worked better! It should work fine with 4-bit and a group-size of 64 as suggested in the code above. Which model did you try the summarization...
I tried with Llama2-7B and it's working fine: ```Python import torch, os cache_path = '.' model_id = "meta-llama/Llama-2-7b-chat-hf" compute_dtype = torch.bfloat16 #int4 kernel only works with bfloat16 device = 'cuda:0'...
There's a pull request for sharded safetensors serialization on-going: https://github.com/huggingface/transformers/pull/32379 Once this is fixed, it's gonna be possible to save hqq-quantized models directly via `model.save_pretrained` as sharded safetensors
Closing this since we are very close to full transformers serialization support here: https://github.com/huggingface/transformers/pull/33141
Technically possible, but hqq is for asymmetric quantization not symmetric, and the available kernels like BitBLAS only support int8 activations as far as I know, which can only be used...
To quantize the activations, you can simply do some dynamic quantization like: ``` Python # Quantize axis=1; x_scale = 127. / x.abs().amax(axis=axis, keepdim=True); x_int8 = (x * x_scale).to(torch.int8); # Dequantize...
I haven't seen that yet, it's a bit too extreme, W4A4 like QuaRot seems to work, lower than that, maybe W3A4 could work with QA. By the way, for A8W4...
Yes, it's compute precision, if the inputs are float16, `compute_dtype` should be float16 as well, same applies to float32 and bfloa16.
Strange! Are you able to import bitblas in Python ?