[Feature Request] Support for Explicit INT32 Bias Quantization to Improve QNN Equivalency
Description: I am currently working on validation between AIMET and the QNN SDK. I have observed that bias quantization is introducing output differences between the two.
Observations:
AIMET vs QNN: AIMET appears to keep bias in Float32, whereas QNN quantizes the bias. This difference, while sometimes small, impacts deep models with high sensitivity.
QNN Behavior: For certain layers (e.g., GEMM, BN), QNN quantizes bias to INT8 even when INT32 is specified, creating a gap against AIMET's Float32 representation.
Reference: When running quantization using ORT (ONNX Runtime), the outputs match QNN Native. This suggests AIMET is the outlier in terms of simulation accuracy regarding bias.
Question/Request: Is there a specific reason why AIMET does not offer the option to quantize bias to INT32? Adding this option would greatly assist in aligning AIMET simulation with QNN/ORT outputs.
Environment: AIMET Version: AIMET ONNX 2.18 QNN SDK Version: 2.37.1
Please let me know if you need any further information.
Thank you for your support, Saïd
Hi @e-said, this is a good point. AIMET doesn't simulate int32 quantization due to numerical instability, but it allows to export int32 quantization encodings if you choose to.
In aimet < 2.21:
# Instantiate int32 encodings
sim._concretize_int32_bias_quantizers()
# Export ONNX + encoding
sim.export(".", "model")
# Export ONNX with Quantize/DequantizeLinear
onnx.save(sim.to_onnx_qdq(), "model_qdq.onnx")
From aimet 2.21 (scheduled mid-late December), it was incorporated into our public export APIs, and you can simply do:
# Export ONNX + encoding
sim.export(".", "model", export_int32_bias=True)
# Export ONNX with Quantize/DequantizeLinear
onnx.save(sim.to_onnx_qdq(export_int32_bias=True), "model_qdq.onnx")
The resulting onnx model will contain int32 bias encodings like this:
Hi @quic-kyunggeu,
Thanks a lot for the clear reply.
Regarding numerical instability. I would like to clarify a few points to better understand the constraint and see if we can find a middle ground:
Scope of Instability: Does this numerical instability concern primarily QAT (where scales are learning/moving, causing the derived bias scale to jump) or does it also apply to PTQ ?
Reasoning: In a PTQ/Validation context where input/weight scales are frozen, the bias scale should also be static. If the concern is integer overflow (due to tiny scales), could we perhaps catch/warn rather than disable it entirely ?
Feature Request: Would it be possible to add an optional argument to QuantizationSimModel (e.g., quantize_bias=False by default) to explicitly enable Int32 bias quantization ?
Goal: enabling this would allow us to "simulate what we execute" and trust the AIMET accuracy numbers before moving to QNN, even if it carries a risk of numerical issues for some edge cases.
Thanks again for your support. Saïd
The scope of instability is for both PTQ and QAT. Integer overflow is one such case, as you've pointed out.
Empirically and historically, AIMET has assumed that int32 bias quantization is lossless, and this assumption has been proven valid for a long time. You aren't the first, and you are not going to be the last 🙂; there have been quite some users that raised similar suspicion about int32 bias quantization being the root cause of sim-to-target deviation, but they always turned out to be a non-issue in all cases that I know of. Although I agree to your design philosophy, it's hard for us to think of any real-world use case of the proposed feature.
That said, I can open an internal ticket for your feature request, but honestly I can't promise any timeline as it will be most likely kept in backlog until we see a real-world example where int32 bias quantization causes meaningful sim-to-target deviation.