LightCompress icon indicating copy to clipboard operation
LightCompress copied to clipboard

Static Activation Quantization and Mixed-Precision Quantization Incompatibility

Open aptsunny opened this issue 9 months ago • 6 comments

When using static quantization for activation, we encountered an issue where the mixed-precision quantization fails due to the order of registering activation parameters. This incompatibility disrupts the expected behavior of mixed-precision quantization, leading to incorrect results or unintended behavior.

Image

aptsunny avatar Apr 11 '25 03:04 aptsunny

@aptsunny , I assume you encountered the same issue I did a few months ago. Please see the bug report I opened in https://github.com/ModelTC/llmc/issues/163. I also posted my fix proposal in a forked repo (https://github.com/sasha-hailo/llmc/tree/main_hailo_share). Hope you find it helpful.

sasha-hailo avatar Apr 14 '25 21:04 sasha-hailo

@aptsunny , I assume you encountered the same issue I did a few months ago. Please see the bug report I opened in #163. I also posted my fix proposal in a forked repo (https://github.com/sasha-hailo/llmc/tree/main_hailo_share). Hope you find it helpful.

I really appreciate your input on this issue. Your solution is spot-on, and I’ll work on getting it implemented.

aptsunny avatar Apr 16 '25 02:04 aptsunny

https://github.com/ModelTC/llmc/blob/b0bf39e96a0ce44f74ec9a42729c09f6cd6f893e/configs/quantization/methods/MixPrecision/rtn_w_a_static.yml#L37

gushiqiao avatar May 07 '25 08:05 gushiqiao

@gushiqiao , Thank you for the update. Is my understanding correct, that this currently supports keeping selected layers in full precision (but not the originally intended granularity of supporting any quantization precision to any layer)?

sasha-hailo avatar May 07 '25 08:05 sasha-hailo

llmc/configs/quantization/methods/MixPrecision/rtn_w_a_static.yml

Line 37 in b0bf39e

ignored_layers:

Hi, I noticed that the latest implementation has removed the mix_bits related functionality. As a result, it seems that mixed-precision quantization experiments with different layers quantized to 8-bit and 16-bit respectively are hard to conduct. Could you please explain the reason for this change and whether there are any alternative ways to achieve such mixed-precision quantization now?

aptsunny avatar May 07 '25 08:05 aptsunny

This setting is deployment-friendly. The previous code structure was somewhat messy, so for now we've opted for a simplified support of 8-bit and 16-bit mixed precision. In theory, all methods in LLMC should work with this setup, whether using static or dynamic quantization.

gushiqiao avatar May 07 '25 09:05 gushiqiao