onnxruntime [Quant Tool] Prevent int32 quantized bias from clipping by adjusting the weight's scale

Description

Fixes scenario in which a bias input quantized to int32 has a scale that is too small. A bias with a scale that is smaller than a certain threshold will overflow the range of an int32 when quantized, which significantly decreases accuracy.

Credit to @yihonglyu for finding out about this issue and the fix.

Motivation and Context

Consider the following Convolution with very small weights and a constant bias input of [5, -4.5].

The QDQ quantizer first computes the following quantization scale for input_0 and weight:

input_0: scale=0.5
weight: scale=7.843e-10 [really small]

The QDQ quantizer then computes the bias input's scale as follows:

bias_scale = input_0_scale * weight_0_scale = 0.5 * 7.843e-10 = 3.9215686274509805e-11

This bias_scale is too small. Before this PR, the QDQ quantizer would quantize the f32 bias with this bias_scale:

bias_quant = round(bias_f32 / bias_scale) =  round([5.0/bias_scale, -4.5/bias_scale]) = [127500000000, -114750000000]

These quantized bias values exceed the range of int32, and so are clipped to [int32.min(), int32.max()], which is very inaccurate.

New approach

This PR increases the weight_0_scale by the necessary amount to ensure that bias_scale (which equals weight_0_scale * input_0_scale) is appropriate for the int32 quantization type.

The smallest valid bias scale is given by the normal scale formula: bias_smallest_valid_scale = (bias_f32_max - bias_f32_min) / (int32_max - int32_min)

Then, we compute the candidate bias scale: bias_scale_candidate = input_0_scale * weight_0_scale

If the candidate scale is smaller than the smallest valid scale, we increase the weight_0_scale by the necessary ratio:

if bias_scale_candidate < bias_smallest_valid_scale:
    ratio = bias_smallest_valid_scale / bias_scale_candidate
    weight_0_scale = ratio * weight_0_scale

Then, we recompute the final bias scale:

bias_scale = input_0_scale * weight_0_scale

Impact on accuracy

Here's the above model's quantized output compared to the f32 (ground-truth) output.

Before PR:
- f32 model output[0]: 5.0f
- qdq model output[0]: 0.075
- SNR: 0.1369 (higher is better)
After PR:
- f32 model output[0]: 5.0f
- qdq model output[0]: 4.992
- SNR: 55.656 (higher is better)

Sep 07 '24 01:09 adrianlizarraga

Found an oversight. Converting back to draft.

Oct 15 '24 17:10 adrianlizarraga

Ready for review

Oct 31 '24 18:10 adrianlizarraga

This PR has been cherry-picked into the rel-1.20.1 branch in PR #22785. Removing the release:1.20.1 label.

Sep 05 '25 21:09 snnn