Increasing Discrepancy Between AIMET Simulation and QNN Models with Network Depth
Hello. I've observed a systematic increase in the difference between AIMET quantization simulation results and actual QNN model outputs as the network depth increases. This discrepancy could potentially impact the reliability of quantization predictions for deeper networks.
Experimental Setup
- Model: Simple residual network
- Each layer consists of: Conv2d(64,64,3) -> BatchNorm2d -> ReLU6
- Input shape: (1, 64, 128, 128)
- Quantization settings:
- 8-bit quantization for both parameters and outputs
- Training range learning with TF initialization scheme
- Per-channel quantization configuration
Observed Behavior
The difference between AIMET simulation and QNN execution increases significantly with network depth:
| Number of Layers | MSE Difference | L1 Difference |
|---|---|---|
| 1 | 5.64e-05 | 0.0012 |
| 3 | 3.77e-04 | 0.0071 |
| 5 | 9.35e-04 | 0.0148 |
| 7 | 2.35e-03 | 0.0275 |
| 9 | 5.58e-03 | 0.0459 |
| 11 | 9.93e-03 | 0.0641 |
| 13 | 1.52e-02 | 0.0818 |
| 15 | 2.36e-02 | 0.1026 |
| 17 | 3.27e-02 | 0.1223 |
| 19 | 4.18e-02 | 0.1374 |
| 21 | 5.88e-02 | 0.1540 |
| 23 | 6.14e-02 | 0.1633 |
| 25 | 7.35e-02 | 0.1825 |
| 27 | 8.77e-02 | 0.1991 |
| 29 | 1.04e-01 | 0.2153 |
Questions
- Is this behavior expected?
- Are there known limitations or assumptions in AIMET simulation that might explain this divergence?
- Are there recommended practices for more accurate simulation of deeper networks?
Reproducible Code
Here's the complete code to reproduce this issue:
import torch
import os
from aimet_common.defs import QuantScheme
from aimet_common.quantsim_config.utils import get_path_for_per_channel_config
from aimet_torch.quantsim import QuantizationSimModel
from aimet_torch import model_preparer
from aimet_torch import batch_norm_fold
import qai_hub as hub
import shutil
torch.manual_seed(1517)
class SimpleModel(torch.nn.Module):
def __init__(self, num_layers=10):
super(SimpleModel, self).__init__()
self.features = torch.nn.ModuleList([torch.nn.Sequential(
torch.nn.Conv2d(64, 64, 3, padding="same"),
torch.nn.BatchNorm2d(64),
torch.nn.ReLU6(),
) for _ in range(num_layers)])
def forward(self, x):
for layer in self.features:
x = torch.add(x, layer(x))
return x
input_shape = (1, 64, 128, 128)
def main(num_layers: int):
# Step 1: Create and prepare model
model = SimpleModel(num_layers)
model = model_preparer.prepare_model(model) # Prepare for quantization
batch_norm_fold.fold_all_batch_norms(model, input_shapes=input_shape) # Fold BN for better quantization
# Step 2: Setup quantization simulation
dummy_input = torch.randn(input_shape) # Create random input tensor
sim = QuantizationSimModel(
model,
dummy_input=dummy_input,
quant_scheme=QuantScheme.training_range_learning_with_tf_init, # Use TF initialization
default_param_bw=8, # 8-bit quantization for parameters
default_output_bw=8, # 8-bit quantization for activations
config_file=get_path_for_per_channel_config() # Use per-channel quantization
)
# Step 3: Calibrate the quantization parameters
def pass_calibration_data(model: torch.nn.Module):
model.eval()
# Pass random data through model 10 times for calibration
for _ in range(10):
model(torch.randn(input_shape))
sim.compute_encodings(pass_calibration_data)
# Step 4: Export the quantized model
model_dir = f"simple_{num_layers}_layer_model"
file_prefix = f"simple_{num_layers}_layer_model"
os.makedirs(model_dir, exist_ok=True)
sim.export(
model_dir,
file_prefix,
dummy_input=dummy_input
)
# Step 5: Prepare model for QNN compilation
# Create .aimet directory and copy necessary files
aimet_dir = f"{model_dir}.aimet"
os.makedirs(aimet_dir, exist_ok=True)
shutil.copy(f"{model_dir}/{file_prefix}.encodings", f"{aimet_dir}/{file_prefix}.encodings")
shutil.copy(f"{model_dir}/{file_prefix}.onnx", f"{aimet_dir}/{file_prefix}.onnx")
# Step 6: Compile model for target device
compile_job = hub.submit_compile_job(
name = f"simple_{num_layers}_layer_model",
model = aimet_dir,
device = hub.Device("Samsung Galaxy S24 Ultra"), # Target device
options = f"--target_runtime qnn_context_binary --compute_unit all",
)
compile_job.download_target_model(f"{model_dir}/{file_prefix}.bin")
# Step 7: Run inference on target device
inference_job = hub.submit_inference_job(
model = compile_job.get_target_model(),
device = hub.Device("Samsung Galaxy S24 Ultra"),
inputs = {list(compile_job.target_shapes.keys())[0]: [dummy_input.detach().numpy()]},
)
# Get inference results
data = inference_job.download_output_data()
# Step 8: Compare AIMET simulation vs QNN results
torch_output = sim.model(dummy_input) # AIMET simulation output
qnn_output = torch.from_numpy(list(data.values())[0][0]) # QNN actual output
# Calculate differences using MSE and L1 metrics
mse_diff = torch.nn.functional.mse_loss(torch_output, qnn_output)
l1_diff = torch.nn.functional.l1_loss(torch_output, qnn_output)
print(f"num_layers: {num_layers}, MSE diff: {mse_diff}, L1 diff: {l1_diff}")
if __name__ == "__main__":
# Test models with different depths (1 to 29 layers, odd numbers only)
for num_layers in range(1, 30, 2):
try:
main(num_layers)
except Exception as e:
print(f"Error: {e}")
Environment
- aimet-torch version: 2.3.0+cu121
- AI_HUB version : 0.26.0
- Device tested: Samsung Galaxy S24 Ultra
- Python version: 3.10
- PyTorch version: 2.4.0
Observing same phenomena with the floating points(fp16) too, is there a way to properly simulate qnn with pytorch? If not, how could we do QATs?
Thanks @pei0033 for detailed issue and repro script. Will run this internally and get back to you with initial feedback.
Could you please let me know if there’s any progress?
@pei0033 Sorry for late response. and I really appreciate your excellently reproducible code 👍
Unfortunately, I didn't get a chance to look into this problem in depth yet. For now, without any concrete analysis, I'm leaving some generally known facts that can help you
- Is this behavior expected?
In general, it is expected that the AIMET-to-QNN discrepancy grows as your model gets deeper. The magnitude of discrepancy also tends to be larger in dummy models with random weights than in real models with real weights. However, even with all that considered, the discrepancy in your example is worrying and requires a closer look.
- Are there known limitations or assumptions in AIMET simulation that might explain this divergence?
There are some known sources of AIMET-to-QNN divergence. For one example, there can be a tiny numerical difference between AIMET and QNN, which leads to round(x / scale) being evaluated as round(0.5) in AIMET and round(0.4999999 ) in QNN. There are also some differences between kernel implementation, such as softmax, whose output can be significantly different depending on the concrete implementation when executed in low-bit precisoin. However, again, none of these known issues explain your current situation very well.
- Are there recommended practices for more accurate simulation of deeper networks?
For now, I can generally recommend three rules.
- Asusming you are targeting HTP, adhere to standard HTP config file by passing
config_file="htp_v<version number>". As of today, "htp_v81" is the latest and greatest. - Adhere to standard PyTorch APIs whenever possible. In other words, don't reinvent something that already exists in pytorch. Why? - AIMET knows how to handle pytorch standard APIs very well, but it knows little about the custom modules defined by the users.
- Always prefer modular APIs to functional APIs. For example, always prefer using
torch.nn.Linearinstead oftorch.nn.functional.linear. Why? - AIMET is designed to provide a richer set of features targeting module-style APIs. Roughly speaking, AIMET will convert each nn.Module into a corresponding quantized module, for example nn.Conv2d toaimet_torch.nn.QuantizedConv2d. As a trade-off, AIMET is not very good at handling functional APIs.
Regarding the specific problem in your issue, I'll come back with a better answer some time this week
@pei0033 Here's a concrete analysis
tl;dr
This is not a bug 😊 This issue looks like a combination of 1. numerical illusion and 2. precision loss due to cumulative add.
1. Numerical illusion
On the face value, it looks as if the AIMET-to-QNN mismatch has grown about 2000x from 1-layer model to 29-layers model. However, it is the output scale that really grew by almost 2000x, not the error between AIMET and QNN. In other words, we're comparing the losses with different scales, as if in this example:
>>> torch.nn.functional.l1_loss(torch.zeros(100), torch.ones(100))
tensor(1.)
>>> torch.nn.functional.l1_loss(torch.zeros(100) * 1000, torch.ones(100) * 1000)
tensor(1000.)
When the outputs are properly normalized with output scale, the sim-to-target difference remains largely stable.
(Off-by-N indicates (aimet_output_int8 - qnn_output_int8).abs().max())
| # Layers | MSE Diff | MSE Diff (normalized) | L1 Diff | L1 Diff (normalized) | Off-by-N |
|---|---|---|---|---|---|
| 1 | 5.639e-05 | 0.026 | 0.001 | 0.026 | 2.0 |
| 9 | 0.005 | 0.528 | 0.043 | 0.447 | 4.0 |
| 19 | 0.042 | 0.954 | 0.139 | 0.661 | 6.0 |
| 29 | 0.107 | 0.934 | 0.218 | 0.644 | 7.0 |
To reproduce this result, apply this patch and rerun your script
diff --git a/quic_issue_3978.py b/quic_issue_3978.py
index 792b905..079c976 100644
--- a/quic_issue_3978.py
+++ b/quic_issue_3978.py
@@ -6,6 +6,7 @@ from aimet_common.quantsim_config.utils import get_path_for_per_channel_config
from aimet_torch.quantsim import QuantizationSimModel
from aimet_torch import model_preparer
from aimet_torch import batch_norm_fold
+from aimet_torch.nn.modules import custom
import qai_hub as hub
import shutil
@@ -19,10 +20,13 @@ class SimpleModel(torch.nn.Module):
torch.nn.BatchNorm2d(64),
torch.nn.ReLU6(),
) for _ in range(num_layers)])
+ self.adds = torch.nn.ModuleList([
+ custom.Add() for _ in range(num_layers)
+ ])
def forward(self, x):
- for layer in self.features:
- x = torch.add(x, layer(x))
+ for layer, add in zip(self.features, self.adds):
+ x = add(x, layer(x))
return x
@@ -31,7 +35,7 @@ input_shape = (1, 64, 128, 128)
def main(num_layers: int):
# Step 1: Create and prepare model
model = SimpleModel(num_layers)
- model = model_preparer.prepare_model(model) # Prepare for quantization
+ # model = model_preparer.prepare_model(model) # Prepare for quantization
batch_norm_fold.fold_all_batch_norms(model, input_shapes=input_shape) # Fold BN for better quantization
# Step 2: Setup quantization simulation
@@ -97,13 +101,25 @@ def main(num_layers: int):
# Calculate differences using MSE and L1 metrics
mse_diff = torch.nn.functional.mse_loss(torch_output, qnn_output)
l1_diff = torch.nn.functional.l1_loss(torch_output, qnn_output)
-
- print(f"num_layers: {num_layers}, MSE diff: {mse_diff}, L1 diff: {l1_diff}")
+ output_scale = sim.model.adds[-1].output_quantizers[0].get_scale()
+ mse_diff_normalized = torch.nn.functional.mse_loss(torch_output / output_scale, qnn_output / output_scale)
+ l1_diff_normalized = torch.nn.functional.l1_loss(torch_output / output_scale, qnn_output / output_scale)
+ off_by_N = (torch_output - qnn_output).abs().div(output_scale).round().max().item()
+
+ print(
+ ", ".join([
+ f"num_layers: {num_layers}",
+ f"MSE diff: {mse_diff}",
+ f"MSE diff (normalized): {mse_diff_normalized}",
+ f"L1 diff: {l1_diff}",
+ f"L1 diff (normalized): {l1_diff_normalized}",
+ f"off-by: {off_by_N}",
+ ])
+ )
if __name__ == "__main__":
- # Test models with different depths (1 to 29 layers, odd numbers only)
- for num_layers in range(1, 30, 2):
+ for num_layers in [1, 9, 19, 29]:
try:
main(num_layers)
except Exception as e:
2. Precision loss due to cumulative add
First of all, It is important to note that quantized add is a lossy operation. In pseudo-code, quantized add on Qualcomm NPU looks like this: (tons of details omitted for brevity)
def quantized_add(x, y):
min = min(x.min(), y.min())
max = max(x.max(), y.max())
x_rescaled = rescale(x, min, max)
y_rescaled = rescale(y, min, max)
return x_rescaled + y_rescaled
That said, when run on HTP, rescaling x into x_rescaled or y into y_rescaled can suffer serious precision loss if the range of x and y differs too largely.
However, this problem doesn't reveal itself in AIMET simulation time because AIMET doesn't use real quantized kernels but only simulates quantization with fake-quantization.
To make things worse, your toy model structure amplifies this problem due to cumulative add, since the output layer(x) is strictly limited to range [0, 6], but the range of accumulator x can grow indefinitely.
Usually this is not a big problem in real models because they tend to have several mathematical properties that are quantization-friendly. For example:
- In real models, activations tend to follow normal distribution
- In real models, activations do not tend do grow indefinitely
- In real models, outputs are often normalized with operations like Softmax, effectively nullifying the accumulated precision errors.
- Real models tend to use valitdation/evaluation metrics that are robust against quantization noise, such as task loss, top-K accuracy, etc.
In fact, removing cumulative add from your toy model seems to remove almost all the sim-to-target mismatches.
| # Layers | MSE Diff | MSE Diff (normalized) | L1 Diff | L1 Diff (normalized) | Off-by-N |
|---|---|---|---|---|---|
| 1 | 9.699e-06 | 0.062 | 0.0007 | 0.062 | 1.0 |
| 9 | 3.893e-09 | 0.066 | 1.6069e-05 | 0.066 | 1.0 |
| 19 | 3.014e-09 | 0.053 | 1.2661e-05 | 0.053 | 1.0 |
| 29 | 2.819e-09 | 0.064 | 1.3528e-05 | 0.064 | 1.0 |
To reproduce this result, apply this patch on top of the patch already given in the previous section and rerun your script
diff --git a/quic_issue_3978.py b/quic_issue_3978.py
index 079c976..73d8c92 100644
--- a/quic_issue_3978.py
+++ b/quic_issue_3978.py
@@ -26,7 +26,7 @@ class SimpleModel(torch.nn.Module):
def forward(self, x):
for layer, add in zip(self.features, self.adds):
- x = add(x, layer(x))
+ x = layer(x)
return x
@@ -40,6 +40,7 @@ def main(num_layers: int):
# Step 2: Setup quantization simulation
dummy_input = torch.randn(input_shape) # Create random input tensor
+
sim = QuantizationSimModel(
model,
dummy_input=dummy_input,
@@ -101,7 +102,7 @@ def main(num_layers: int):
# Calculate differences using MSE and L1 metrics
mse_diff = torch.nn.functional.mse_loss(torch_output, qnn_output)
l1_diff = torch.nn.functional.l1_loss(torch_output, qnn_output)
- output_scale = sim.model.adds[-1].output_quantizers[0].get_scale()
+ output_scale = sim.model.features[-1][-1].output_quantizers[0].get_scale()
mse_diff_normalized = torch.nn.functional.mse_loss(torch_output / output_scale, qnn_output / output_scale)
l1_diff_normalized = torch.nn.functional.l1_loss(torch_output / output_scale, qnn_output / output_scale)
off_by_N = (torch_output - qnn_output).abs().div(output_scale).round().max().item()
Please let me know if you have more doubts