[Quant Tool] Prevent int32 quantized bias from clipping by adjusting the weight's scale
Description
Fixes scenario in which a bias input quantized to int32 has a scale that is too small. A bias with a scale that is smaller than a certain threshold will overflow the range of an int32 when quantized, which significantly decreases accuracy.
Credit to @yihonglyu for finding out about this issue and the fix.
Motivation and Context
Consider the following Convolution with very small weights and a constant bias input of [5, -4.5].
The QDQ quantizer first computes the following quantization scale for input_0 and weight:
input_0: scale=0.5weight: scale=7.843e-10 [really small]
The QDQ quantizer then computes the bias input's scale as follows:
bias_scale = input_0_scale * weight_0_scale = 0.5 * 7.843e-10 = 3.9215686274509805e-11
This bias_scale is too small. Before this PR, the QDQ quantizer would quantize the f32 bias with this bias_scale:
bias_quant = round(bias_f32 / bias_scale) = round([5.0/bias_scale, -4.5/bias_scale]) = [127500000000, -114750000000]
These quantized bias values exceed the range of int32, and so are clipped to [int32.min(), int32.max()], which is very inaccurate.
New approach
This PR increases the weight_0_scale by the necessary amount to ensure that bias_scale (which equals weight_0_scale * input_0_scale) is appropriate for the int32 quantization type.
The smallest valid bias scale is given by the normal scale formula:
bias_smallest_valid_scale = (bias_f32_max - bias_f32_min) / (int32_max - int32_min)
Then, we compute the candidate bias scale:
bias_scale_candidate = input_0_scale * weight_0_scale
If the candidate scale is smaller than the smallest valid scale, we increase the weight_0_scale by the necessary ratio:
if bias_scale_candidate < bias_smallest_valid_scale:
ratio = bias_smallest_valid_scale / bias_scale_candidate
weight_0_scale = ratio * weight_0_scale
Then, we recompute the final bias scale:
bias_scale = input_0_scale * weight_0_scale
Impact on accuracy
Here's the above model's quantized output compared to the f32 (ground-truth) output.
- Before PR:
- f32 model output[0]: 5.0f
- qdq model output[0]: 0.075
- SNR: 0.1369 (higher is better)
- After PR:
- f32 model output[0]: 5.0f
- qdq model output[0]: 4.992
- SNR: 55.656 (higher is better)
Found an oversight. Converting back to draft.
Ready for review
This PR has been cherry-picked into the rel-1.20.1 branch in PR #22785. Removing the release:1.20.1 label.