LightCompress icon indicating copy to clipboard operation
LightCompress copied to clipboard

BUG: Mixed-precision configuration not working with STATIC quantization

Open sasha-hailo opened this issue 1 year ago • 17 comments

Dear LLMC team, I've been trying to run mixed-precision PTQ quantization using RTN. I suspect there's a bug, as the non-default settings in mix_bits are ignored.

My understanding of the code:

  • In method get_act_qparams() of rtn.py, the values of qmax / qmin / scales / zeros are determined using the default quantizer bit precision
  • These values are registered as buf_act_<xxx> buffers, for all modules / layers.
  • During inference time, in method a_qdq() of rtn.py, though the aquantizer object of each layer is configured correctly, it blindly loads from buffer the registered quantization parameters qmin / qmax / scales / zeros, and uses them, instead of the actual values it should support.

What do you think? Thanks in advance!

sasha-hailo avatar Oct 27 '24 22:10 sasha-hailo

There's no get_act_qparams() in rtn.py. You can print the bit-width of each linear to check the code.

PS: This function hasn't update for a long time. If you confirm there's a bug, please feel free to contact me anytime.

Harahan avatar Nov 01 '24 15:11 Harahan

Hi @Harahan, Thank you for your response. It turns out that a lot of changes have been made since my issue report (in this commit). The functionality I was referring to as get_act_qparams() now resides in register_act_qparams() in file base_blockwise_quantization.py.

The bug, unfortunately, persists.

The "mechanism" is the same: function register_act_qparams() uses a single quantizer object (self.aquantizer) to determine the quantization parameters of all layers - and this quantizer is configured with the default settings. It determines the scale & zero point settings (w.r.t. incorrect bit width), and registers them via buf_act_scales / buf_act_zeros.

Note that the correct per-layer quantization configurations are loaded when executing deploy() function, But they have no effect, as they are using the incorrect scale & zero-point values determined in the previous stage!

To sum it up: I think that the core issue that causes the [suspected] bug is that the calibration stage & function register_act_qparams() are unaware of the configured mixed-precision, and work with the default quantization config. This code probably works well in dynamic quantization, but not in a static quantization scenario. I also suspect that the same issue can happen with other quantization methods.

Can you please look into it? Thanks in advance!

sasha-hailo avatar Nov 04 '24 14:11 sasha-hailo

P.S. An unrelated question: I also noticed that the commit I mentioned above added some limited support to additional quantization granularity, via functions get_matmul_in_block(), get_softmax_in_block(), get_act_fn_in_block(). Do you plan to extend this support to the more common LLM models like Qwen & LLama? (This could be really cool)

sasha-hailo avatar Nov 04 '24 14:11 sasha-hailo

It depends on whether we encounter such a need or it will be used in our research. So, not sure.

Harahan avatar Nov 04 '24 16:11 Harahan

Did you succeed in reproducing the mix_bits problem I reported? I believe the issue should be reopened as a bug...

sasha-hailo avatar Nov 05 '24 08:11 sasha-hailo

I'm sorry, but we do not have enough time to do this. If you are sure there's a bug, post the log/evidence and reopen the issue.

Harahan avatar Nov 05 '24 09:11 Harahan

LLMC_RTN_W8A8_MixedA16_Bug.txt LLMC_RTN_W8A8.txt

I'm pretty sure this is a bug. And I now suspect that the issue affects not only RTN, but nearly any method based on static quantization. Can you please reopen the issue? I don't think I have the permissions for this.

Please find attached 2 logs of LLMC with an RTN configuration. One log refers to a configuration without mix-bit, the other with mix-bit. If you compare the two files, you can see that

  • The outputs of both runs are identical (same PPL score), hinting that the mix_bit configuration had no effect.
  • The mix_bit configuration of the deployed model is correct (see circa line 2458 in the log) ==> the bug is not at the deployment stage, but at the calibration stage (see my explanation in earlier messages).

sasha-hailo avatar Nov 05 '24 16:11 sasha-hailo

I reopen the issue. Since we currently don't have the requirement for the static quantization, the bug may be fixed a long time later. You'd best try other settings.

Harahan avatar Nov 07 '24 20:11 Harahan

Hi,

I wanted to start using this library for a couple of things, but just to confirm, this bug affects situations where: Static quantization is applied layer-wise (with the intention to have different layers/components at different bit-widths)

Can it be confirmed that it does not apply when I would like to more or less have the same bit-width for all components of the model or different for activations/weights?

nelaturuharsha avatar Nov 29 '24 14:11 nelaturuharsha

To the best of my understanding, If the quantization configuration is the same for all layers of the model, the bug does not apply.

sasha-hailo avatar Dec 01 '24 16:12 sasha-hailo

Hi @Harahan , LLMC folks, I wanted to let you know that I have solved the bug in my side branch. In addition, I also added support of separate configurable quantization of activation outputs [currently, only for linear layers].

If you're interested in any of these, please let me know, and I'll be glad to share my code or open a PR. Note, though, that the changes are quite extensive and will require time and commitment from your side to review.

sasha-hailo avatar Jan 12 '25 08:01 sasha-hailo

Hi @Harahan , LLMC folks, I wanted to let you know that I have solved the bug in my side branch. In addition, I also added support of separate configurable quantization of activation outputs [currently, only for linear layers].

If you're interested in any of these, please let me know, and I'll be glad to share my code or open a PR. Note, though, that the changes are quite extensive and will require time and commitment from your side to review.

Hi, @sasha-hailo, I was using smoothquant's mixed bits quantization and ran into the same bug. How did you solve it? Please share your code.

AaronMaYue avatar Feb 12 '25 03:02 AaronMaYue

@AaronMaYue , I apologize for the late response. If it's still relevant, please let me know and I'll clean up my code for sharing.

sasha-hailo avatar Feb 26 '25 12:02 sasha-hailo

@AaronMaYue , I apologize for the late response. If it's still relevant, please let me know and I'll clean up my code for sharing.

Yeah, it's still relevant. The mix-precision problem remains, too. looking forward to your sharing.

AaronMaYue avatar Feb 27 '25 03:02 AaronMaYue

@AaronMaYue , I organized and pushed my modifications to https://github.com/sasha-hailo/llmc/tree/main_hailo_share. Note: I never tested it with SmoothQuant. Hope it works for you.

@Harahan , @gushiqiao - would you like to consider using my code to fix the bug (as well as many additional issues)? I'll be glad to assist, if needed.

sasha-hailo avatar Feb 27 '25 15:02 sasha-hailo

@AaronMaYue , I organized and pushed my modifications to https://github.com/sasha-hailo/llmc/tree/main_hailo_share. Note: I never tested it with SmoothQuant. Hope it works for you.

@Harahan , @gushiqiao - would you like to consider using my code to fix the bug (as well as many additional issues)? I'll be glad to assist, if needed.

Hi,@sasha-hailo, thank you very much for your work, your code works😊, I tested it on both Smoothquant and RTN, the bug fixed.

AaronMaYue avatar Mar 03 '25 02:03 AaronMaYue

https://github.com/ModelTC/llmc/blob/b0bf39e96a0ce44f74ec9a42729c09f6cd6f893e/configs/quantization/methods/MixPrecision/rtn_w_a_static.yml#L37

gushiqiao avatar May 07 '25 08:05 gushiqiao