mobicham comments

Results 113 comments of


                                            mobicham

Feature Request: Implement Static Cache and Quantization Techniques in CTranslate2

@minhthuc2502 yes, it's not performing optimization: https://github.com/mobiusml/hqq/blob/master/hqq/core/quantize.py#L115-L122 , which is the actual hqq algo: https://github.com/mobiusml/hqq/blob/master/hqq/core/optimize.py#L194-L243 Basically, you get an initial estimate of the quantized weights/scale/zero (which is what you did),...

Feature Request: Implement Static Cache and Quantization Techniques in CTranslate2

Glad to hear it worked better! It should work fine with 4-bit and a group-size of 64 as suggested in the code above. Which model did you try the summarization...

Feature Request: Implement Static Cache and Quantization Techniques in CTranslate2

I tried with Llama2-7B and it's working fine: ```Python import torch, os cache_path = '.' model_id = "meta-llama/Llama-2-7b-chat-hf" compute_dtype = torch.bfloat16 #int4 kernel only works with bfloat16 device = 'cuda:0'...

Weight Sharding

There's a pull request for sharded safetensors serialization on-going: https://github.com/huggingface/transformers/pull/32379 Once this is fixed, it's gonna be possible to save hqq-quantized models directly via `model.save_pretrained` as sharded safetensors

Weight Sharding

Closing this since we are very close to full transformers serialization support here: https://github.com/huggingface/transformers/pull/33141

Activation quantization

Technically possible, but hqq is for asymmetric quantization not symmetric, and the available kernels like BitBLAS only support int8 activations as far as I know, which can only be used...

Activation quantization

To quantize the activations, you can simply do some dynamic quantization like: ``` Python # Quantize axis=1; x_scale = 127. / x.abs().amax(axis=axis, keepdim=True); x_int8 = (x * x_scale).to(torch.int8); # Dequantize...

Activation quantization

I haven't seen that yet, it's a bit too extreme, W4A4 like QuaRot seems to work, lower than that, maybe W3A4 could work with QA. By the way, for A8W4...

Activation quantization

Yes, it's compute precision, if the inputs are float16, `compute_dtype` should be float16 as well, same applies to float32 and bfloa16.

Warning: failed to import the BitBlas backend

Strange! Are you able to import bitblas in Python ?