Static Activation Quantization and Mixed-Precision Quantization Incompatibility
When using static quantization for activation, we encountered an issue where the mixed-precision quantization fails due to the order of registering activation parameters. This incompatibility disrupts the expected behavior of mixed-precision quantization, leading to incorrect results or unintended behavior.
@aptsunny , I assume you encountered the same issue I did a few months ago. Please see the bug report I opened in https://github.com/ModelTC/llmc/issues/163. I also posted my fix proposal in a forked repo (https://github.com/sasha-hailo/llmc/tree/main_hailo_share). Hope you find it helpful.
@aptsunny , I assume you encountered the same issue I did a few months ago. Please see the bug report I opened in #163. I also posted my fix proposal in a forked repo (https://github.com/sasha-hailo/llmc/tree/main_hailo_share). Hope you find it helpful.
I really appreciate your input on this issue. Your solution is spot-on, and I’ll work on getting it implemented.
https://github.com/ModelTC/llmc/blob/b0bf39e96a0ce44f74ec9a42729c09f6cd6f893e/configs/quantization/methods/MixPrecision/rtn_w_a_static.yml#L37
@gushiqiao , Thank you for the update. Is my understanding correct, that this currently supports keeping selected layers in full precision (but not the originally intended granularity of supporting any quantization precision to any layer)?
llmc/configs/quantization/methods/MixPrecision/rtn_w_a_static.yml
Line 37 in b0bf39e
ignored_layers:
Hi, I noticed that the latest implementation has removed the mix_bits related functionality. As a result, it seems that mixed-precision quantization experiments with different layers quantized to 8-bit and 16-bit respectively are hard to conduct. Could you please explain the reason for this change and whether there are any alternative ways to achieve such mixed-precision quantization now?
This setting is deployment-friendly. The previous code structure was somewhat messy, so for now we've opted for a simplified support of 8-bit and 16-bit mixed precision. In theory, all methods in LLMC should work with this setup, whether using static or dynamic quantization.