BUG: Mixed-precision configuration not working with STATIC quantization
Dear LLMC team,
I've been trying to run mixed-precision PTQ quantization using RTN.
I suspect there's a bug, as the non-default settings in mix_bits are ignored.
My understanding of the code:
- In method
get_act_qparams()ofrtn.py, the values ofqmax/qmin/scales/zerosare determined using the default quantizer bit precision - These values are registered as
buf_act_<xxx>buffers, for all modules / layers. - During inference time, in method
a_qdq()ofrtn.py, though theaquantizerobject of each layer is configured correctly, it blindly loads from buffer the registered quantization parametersqmin/qmax/scales/zeros, and uses them, instead of the actual values it should support.
What do you think? Thanks in advance!
There's no get_act_qparams() in rtn.py. You can print the bit-width of each linear to check the code.
PS: This function hasn't update for a long time. If you confirm there's a bug, please feel free to contact me anytime.
Hi @Harahan,
Thank you for your response.
It turns out that a lot of changes have been made since my issue report (in this commit).
The functionality I was referring to as get_act_qparams() now resides in register_act_qparams() in file base_blockwise_quantization.py.
The bug, unfortunately, persists.
The "mechanism" is the same: function register_act_qparams() uses a single quantizer object (self.aquantizer) to determine the quantization parameters of all layers - and this quantizer is configured with the default settings. It determines the scale & zero point settings (w.r.t. incorrect bit width), and registers them via buf_act_scales / buf_act_zeros.
Note that the correct per-layer quantization configurations are loaded when executing deploy() function,
But they have no effect, as they are using the incorrect scale & zero-point values determined in the previous stage!
To sum it up: I think that the core issue that causes the [suspected] bug is that the calibration stage & function register_act_qparams() are unaware of the configured mixed-precision, and work with the default quantization config.
This code probably works well in dynamic quantization, but not in a static quantization scenario.
I also suspect that the same issue can happen with other quantization methods.
Can you please look into it? Thanks in advance!
P.S.
An unrelated question:
I also noticed that the commit I mentioned above added some limited support to additional quantization granularity, via functions get_matmul_in_block(), get_softmax_in_block(), get_act_fn_in_block().
Do you plan to extend this support to the more common LLM models like Qwen & LLama?
(This could be really cool)
It depends on whether we encounter such a need or it will be used in our research. So, not sure.
Did you succeed in reproducing the mix_bits problem I reported?
I believe the issue should be reopened as a bug...
I'm sorry, but we do not have enough time to do this. If you are sure there's a bug, post the log/evidence and reopen the issue.
LLMC_RTN_W8A8_MixedA16_Bug.txt LLMC_RTN_W8A8.txt
I'm pretty sure this is a bug. And I now suspect that the issue affects not only RTN, but nearly any method based on static quantization. Can you please reopen the issue? I don't think I have the permissions for this.
Please find attached 2 logs of LLMC with an RTN configuration. One log refers to a configuration without mix-bit, the other with mix-bit. If you compare the two files, you can see that
- The outputs of both runs are identical (same PPL score), hinting that the
mix_bitconfiguration had no effect. - The
mix_bitconfiguration of the deployed model is correct (see circa line 2458 in the log) ==> the bug is not at the deployment stage, but at the calibration stage (see my explanation in earlier messages).
I reopen the issue. Since we currently don't have the requirement for the static quantization, the bug may be fixed a long time later. You'd best try other settings.
Hi,
I wanted to start using this library for a couple of things, but just to confirm, this bug affects situations where: Static quantization is applied layer-wise (with the intention to have different layers/components at different bit-widths)
Can it be confirmed that it does not apply when I would like to more or less have the same bit-width for all components of the model or different for activations/weights?
To the best of my understanding, If the quantization configuration is the same for all layers of the model, the bug does not apply.
Hi @Harahan , LLMC folks, I wanted to let you know that I have solved the bug in my side branch. In addition, I also added support of separate configurable quantization of activation outputs [currently, only for linear layers].
If you're interested in any of these, please let me know, and I'll be glad to share my code or open a PR. Note, though, that the changes are quite extensive and will require time and commitment from your side to review.
Hi @Harahan , LLMC folks, I wanted to let you know that I have solved the bug in my side branch. In addition, I also added support of separate configurable quantization of activation outputs [currently, only for linear layers].
If you're interested in any of these, please let me know, and I'll be glad to share my code or open a PR. Note, though, that the changes are quite extensive and will require time and commitment from your side to review.
Hi, @sasha-hailo, I was using smoothquant's mixed bits quantization and ran into the same bug. How did you solve it? Please share your code.
@AaronMaYue , I apologize for the late response. If it's still relevant, please let me know and I'll clean up my code for sharing.
@AaronMaYue , I apologize for the late response. If it's still relevant, please let me know and I'll clean up my code for sharing.
Yeah, it's still relevant. The mix-precision problem remains, too. looking forward to your sharing.
@AaronMaYue , I organized and pushed my modifications to https://github.com/sasha-hailo/llmc/tree/main_hailo_share. Note: I never tested it with SmoothQuant. Hope it works for you.
@Harahan , @gushiqiao - would you like to consider using my code to fix the bug (as well as many additional issues)? I'll be glad to assist, if needed.
@AaronMaYue , I organized and pushed my modifications to https://github.com/sasha-hailo/llmc/tree/main_hailo_share. Note: I never tested it with SmoothQuant. Hope it works for you.
@Harahan , @gushiqiao - would you like to consider using my code to fix the bug (as well as many additional issues)? I'll be glad to assist, if needed.
Hi,@sasha-hailo, thank you very much for your work, your code works😊, I tested it on both Smoothquant and RTN, the bug fixed.
https://github.com/ModelTC/llmc/blob/b0bf39e96a0ce44f74ec9a42729c09f6cd6f893e/configs/quantization/methods/MixPrecision/rtn_w_a_static.yml#L37