aimet Increasing Discrepancy Between AIMET Simulation and QNN Models with Network Depth

Hello. I've observed a systematic increase in the difference between AIMET quantization simulation results and actual QNN model outputs as the network depth increases. This discrepancy could potentially impact the reliability of quantization predictions for deeper networks.

Experimental Setup

Model: Simple residual network
Each layer consists of: Conv2d(64,64,3) -> BatchNorm2d -> ReLU6
Input shape: (1, 64, 128, 128)
Quantization settings:
- 8-bit quantization for both parameters and outputs
- Training range learning with TF initialization scheme
- Per-channel quantization configuration

Observed Behavior

The difference between AIMET simulation and QNN execution increases significantly with network depth:

Number of Layers	MSE Difference	L1 Difference
1	5.64e-05	0.0012
3	3.77e-04	0.0071
5	9.35e-04	0.0148
7	2.35e-03	0.0275
9	5.58e-03	0.0459
11	9.93e-03	0.0641
13	1.52e-02	0.0818
15	2.36e-02	0.1026
17	3.27e-02	0.1223
19	4.18e-02	0.1374
21	5.88e-02	0.1540
23	6.14e-02	0.1633
25	7.35e-02	0.1825
27	8.77e-02	0.1991
29	1.04e-01	0.2153

Questions

Is this behavior expected?
Are there known limitations or assumptions in AIMET simulation that might explain this divergence?
Are there recommended practices for more accurate simulation of deeper networks?

Reproducible Code

Here's the complete code to reproduce this issue:

import torch
import os

from aimet_common.defs import QuantScheme
from aimet_common.quantsim_config.utils import get_path_for_per_channel_config
from aimet_torch.quantsim import QuantizationSimModel
from aimet_torch import model_preparer
from aimet_torch import batch_norm_fold
import qai_hub as hub

import shutil
torch.manual_seed(1517)

class SimpleModel(torch.nn.Module):
    def __init__(self, num_layers=10):
        super(SimpleModel, self).__init__()
        self.features = torch.nn.ModuleList([torch.nn.Sequential(
            torch.nn.Conv2d(64, 64, 3, padding="same"),
            torch.nn.BatchNorm2d(64),
            torch.nn.ReLU6(),
        ) for _ in range(num_layers)])

    def forward(self, x):
        for layer in self.features:
            x = torch.add(x, layer(x))
        return x


input_shape = (1, 64, 128, 128)

def main(num_layers: int):
    # Step 1: Create and prepare model
    model = SimpleModel(num_layers)
    model = model_preparer.prepare_model(model)  # Prepare for quantization
    batch_norm_fold.fold_all_batch_norms(model, input_shapes=input_shape)  # Fold BN for better quantization

    # Step 2: Setup quantization simulation
    dummy_input = torch.randn(input_shape)  # Create random input tensor
    sim = QuantizationSimModel(
        model,
        dummy_input=dummy_input,
        quant_scheme=QuantScheme.training_range_learning_with_tf_init,  # Use TF initialization
        default_param_bw=8,      # 8-bit quantization for parameters
        default_output_bw=8,     # 8-bit quantization for activations
        config_file=get_path_for_per_channel_config()  # Use per-channel quantization
    )

    # Step 3: Calibrate the quantization parameters
    def pass_calibration_data(model: torch.nn.Module):
        model.eval()
        # Pass random data through model 10 times for calibration
        for _ in range(10):
            model(torch.randn(input_shape))

    sim.compute_encodings(pass_calibration_data)

    # Step 4: Export the quantized model
    model_dir = f"simple_{num_layers}_layer_model"
    file_prefix = f"simple_{num_layers}_layer_model"
    os.makedirs(model_dir, exist_ok=True)
    sim.export(
        model_dir,
        file_prefix,
        dummy_input=dummy_input
    )
    
    # Step 5: Prepare model for QNN compilation
    # Create .aimet directory and copy necessary files
    aimet_dir = f"{model_dir}.aimet"
    os.makedirs(aimet_dir, exist_ok=True)
    shutil.copy(f"{model_dir}/{file_prefix}.encodings", f"{aimet_dir}/{file_prefix}.encodings")
    shutil.copy(f"{model_dir}/{file_prefix}.onnx", f"{aimet_dir}/{file_prefix}.onnx")

    # Step 6: Compile model for target device
    compile_job = hub.submit_compile_job(
        name = f"simple_{num_layers}_layer_model",
        model = aimet_dir,
        device = hub.Device("Samsung Galaxy S24 Ultra"),  # Target device
        options = f"--target_runtime qnn_context_binary --compute_unit all",
    )
    compile_job.download_target_model(f"{model_dir}/{file_prefix}.bin")

    # Step 7: Run inference on target device
    inference_job = hub.submit_inference_job(
        model = compile_job.get_target_model(),
        device = hub.Device("Samsung Galaxy S24 Ultra"),
        inputs = {list(compile_job.target_shapes.keys())[0]: [dummy_input.detach().numpy()]},
    )

    # Get inference results
    data = inference_job.download_output_data()

    # Step 8: Compare AIMET simulation vs QNN results
    torch_output = sim.model(dummy_input)  # AIMET simulation output
    qnn_output = torch.from_numpy(list(data.values())[0][0])  # QNN actual output

    # Calculate differences using MSE and L1 metrics
    mse_diff = torch.nn.functional.mse_loss(torch_output, qnn_output)
    l1_diff = torch.nn.functional.l1_loss(torch_output, qnn_output)

    print(f"num_layers: {num_layers}, MSE diff: {mse_diff}, L1 diff: {l1_diff}")


if __name__ == "__main__":
    # Test models with different depths (1 to 29 layers, odd numbers only)
    for num_layers in range(1, 30, 2):
        try:
            main(num_layers)
        except Exception as e:
            print(f"Error: {e}")

Environment

aimet-torch version: 2.3.0+cu121
AI_HUB version : 0.26.0
Device tested: Samsung Galaxy S24 Ultra
Python version: 3.10
PyTorch version: 2.4.0

Apr 17 '25 02:04 pei0033

Observing same phenomena with the floating points(fp16) too, is there a way to properly simulate qnn with pytorch? If not, how could we do QATs?

Apr 17 '25 08:04 huijjj

Thanks @pei0033 for detailed issue and repro script. Will run this internally and get back to you with initial feedback.

Apr 17 '25 21:04 quic-bhushans

Could you please let me know if there’s any progress?

May 28 '25 04:05 pei0033

@pei0033 Sorry for late response. and I really appreciate your excellently reproducible code 👍

Unfortunately, I didn't get a chance to look into this problem in depth yet. For now, without any concrete analysis, I'm leaving some generally known facts that can help you

Is this behavior expected?

In general, it is expected that the AIMET-to-QNN discrepancy grows as your model gets deeper. The magnitude of discrepancy also tends to be larger in dummy models with random weights than in real models with real weights. However, even with all that considered, the discrepancy in your example is worrying and requires a closer look.

Are there known limitations or assumptions in AIMET simulation that might explain this divergence?

There are some known sources of AIMET-to-QNN divergence. For one example, there can be a tiny numerical difference between AIMET and QNN, which leads to round(x / scale) being evaluated as round(0.5) in AIMET and round(0.4999999 ) in QNN. There are also some differences between kernel implementation, such as softmax, whose output can be significantly different depending on the concrete implementation when executed in low-bit precisoin. However, again, none of these known issues explain your current situation very well.

Are there recommended practices for more accurate simulation of deeper networks?

For now, I can generally recommend three rules.

Asusming you are targeting HTP, adhere to standard HTP config file by passing config_file="htp_v<version number>". As of today, "htp_v81" is the latest and greatest.
Adhere to standard PyTorch APIs whenever possible. In other words, don't reinvent something that already exists in pytorch. Why? - AIMET knows how to handle pytorch standard APIs very well, but it knows little about the custom modules defined by the users.
Always prefer modular APIs to functional APIs. For example, always prefer using torch.nn.Linear instead of torch.nn.functional.linear. Why? - AIMET is designed to provide a richer set of features targeting module-style APIs. Roughly speaking, AIMET will convert each nn.Module into a corresponding quantized module, for example nn.Conv2d to aimet_torch.nn.QuantizedConv2d. As a trade-off, AIMET is not very good at handling functional APIs.

Regarding the specific problem in your issue, I'll come back with a better answer some time this week

May 28 '25 22:05 quic-kyunggeu

@pei0033 Here's a concrete analysis

tl;dr

This is not a bug 😊 This issue looks like a combination of 1. numerical illusion and 2. precision loss due to cumulative add.

1. Numerical illusion

On the face value, it looks as if the AIMET-to-QNN mismatch has grown about 2000x from 1-layer model to 29-layers model. However, it is the output scale that really grew by almost 2000x, not the error between AIMET and QNN. In other words, we're comparing the losses with different scales, as if in this example:

>>> torch.nn.functional.l1_loss(torch.zeros(100), torch.ones(100))
tensor(1.)
>>> torch.nn.functional.l1_loss(torch.zeros(100) * 1000, torch.ones(100) * 1000)
tensor(1000.)

When the outputs are properly normalized with output scale, the sim-to-target difference remains largely stable. (Off-by-N indicates (aimet_output_int8 - qnn_output_int8).abs().max())

# Layers	MSE Diff	MSE Diff (normalized)	L1 Diff	L1 Diff (normalized)	Off-by-N
1	5.639e-05	0.026	0.001	0.026	2.0
9	0.005	0.528	0.043	0.447	4.0
19	0.042	0.954	0.139	0.661	6.0
29	0.107	0.934	0.218	0.644	7.0

To reproduce this result, apply this patch and rerun your script

diff --git a/quic_issue_3978.py b/quic_issue_3978.py
index 792b905..079c976 100644
--- a/quic_issue_3978.py
+++ b/quic_issue_3978.py
@@ -6,6 +6,7 @@ from aimet_common.quantsim_config.utils import get_path_for_per_channel_config
 from aimet_torch.quantsim import QuantizationSimModel
 from aimet_torch import model_preparer
 from aimet_torch import batch_norm_fold
+from aimet_torch.nn.modules import custom
 import qai_hub as hub
 
 import shutil
@@ -19,10 +20,13 @@ class SimpleModel(torch.nn.Module):
             torch.nn.BatchNorm2d(64),
             torch.nn.ReLU6(),
         ) for _ in range(num_layers)])
+        self.adds = torch.nn.ModuleList([
+            custom.Add() for _ in range(num_layers)
+        ])
 
     def forward(self, x):
-        for layer in self.features:
-            x = torch.add(x, layer(x))
+        for layer, add in zip(self.features, self.adds):
+            x = add(x, layer(x))
         return x
 
 
@@ -31,7 +35,7 @@ input_shape = (1, 64, 128, 128)
 def main(num_layers: int):
     # Step 1: Create and prepare model
     model = SimpleModel(num_layers)
-    model = model_preparer.prepare_model(model)  # Prepare for quantization
+    # model = model_preparer.prepare_model(model)  # Prepare for quantization
     batch_norm_fold.fold_all_batch_norms(model, input_shapes=input_shape)  # Fold BN for better quantization
 
     # Step 2: Setup quantization simulation
@@ -97,13 +101,25 @@ def main(num_layers: int):
     # Calculate differences using MSE and L1 metrics
     mse_diff = torch.nn.functional.mse_loss(torch_output, qnn_output)
     l1_diff = torch.nn.functional.l1_loss(torch_output, qnn_output)
-
-    print(f"num_layers: {num_layers}, MSE diff: {mse_diff}, L1 diff: {l1_diff}")
+    output_scale = sim.model.adds[-1].output_quantizers[0].get_scale()
+    mse_diff_normalized = torch.nn.functional.mse_loss(torch_output / output_scale, qnn_output / output_scale)
+    l1_diff_normalized = torch.nn.functional.l1_loss(torch_output / output_scale, qnn_output / output_scale)
+    off_by_N = (torch_output - qnn_output).abs().div(output_scale).round().max().item()
+
+    print(
+        ", ".join([
+            f"num_layers: {num_layers}",
+            f"MSE diff: {mse_diff}",
+            f"MSE diff (normalized): {mse_diff_normalized}",
+            f"L1 diff: {l1_diff}",
+            f"L1 diff (normalized): {l1_diff_normalized}",
+            f"off-by:  {off_by_N}",
+        ])
+    )
 
 
 if __name__ == "__main__":
-    # Test models with different depths (1 to 29 layers, odd numbers only)
-    for num_layers in range(1, 30, 2):
+    for num_layers in [1, 9, 19, 29]:
         try:
             main(num_layers)
         except Exception as e:

2. Precision loss due to cumulative add

First of all, It is important to note that quantized add is a lossy operation. In pseudo-code, quantized add on Qualcomm NPU looks like this: (tons of details omitted for brevity)

def quantized_add(x, y):
    min = min(x.min(), y.min())
    max = max(x.max(), y.max())
    x_rescaled = rescale(x, min, max)
    y_rescaled = rescale(y, min, max)
    return x_rescaled + y_rescaled

That said, when run on HTP, rescaling x into x_rescaled or y into y_rescaled can suffer serious precision loss if the range of x and y differs too largely. However, this problem doesn't reveal itself in AIMET simulation time because AIMET doesn't use real quantized kernels but only simulates quantization with fake-quantization. To make things worse, your toy model structure amplifies this problem due to cumulative add, since the output layer(x) is strictly limited to range [0, 6], but the range of accumulator x can grow indefinitely.

Usually this is not a big problem in real models because they tend to have several mathematical properties that are quantization-friendly. For example:

In real models, activations tend to follow normal distribution
In real models, activations do not tend do grow indefinitely
In real models, outputs are often normalized with operations like Softmax, effectively nullifying the accumulated precision errors.
Real models tend to use valitdation/evaluation metrics that are robust against quantization noise, such as task loss, top-K accuracy, etc.

In fact, removing cumulative add from your toy model seems to remove almost all the sim-to-target mismatches.

# Layers	MSE Diff	MSE Diff (normalized)	L1 Diff	L1 Diff (normalized)	Off-by-N
1	9.699e-06	0.062	0.0007	0.062	1.0
9	3.893e-09	0.066	1.6069e-05	0.066	1.0
19	3.014e-09	0.053	1.2661e-05	0.053	1.0
29	2.819e-09	0.064	1.3528e-05	0.064	1.0

To reproduce this result, apply this patch on top of the patch already given in the previous section and rerun your script

diff --git a/quic_issue_3978.py b/quic_issue_3978.py
index 079c976..73d8c92 100644
--- a/quic_issue_3978.py
+++ b/quic_issue_3978.py
@@ -26,7 +26,7 @@ class SimpleModel(torch.nn.Module):

     def forward(self, x):
         for layer, add in zip(self.features, self.adds):
-            x = add(x, layer(x))
+            x = layer(x)
         return x


@@ -40,6 +40,7 @@ def main(num_layers: int):

     # Step 2: Setup quantization simulation
     dummy_input = torch.randn(input_shape)  # Create random input tensor
+
     sim = QuantizationSimModel(
         model,
         dummy_input=dummy_input,
@@ -101,7 +102,7 @@ def main(num_layers: int):
     # Calculate differences using MSE and L1 metrics
     mse_diff = torch.nn.functional.mse_loss(torch_output, qnn_output)
     l1_diff = torch.nn.functional.l1_loss(torch_output, qnn_output)
-    output_scale = sim.model.adds[-1].output_quantizers[0].get_scale()
+    output_scale = sim.model.features[-1][-1].output_quantizers[0].get_scale()
     mse_diff_normalized = torch.nn.functional.mse_loss(torch_output / output_scale, qnn_output / output_scale)
     l1_diff_normalized = torch.nn.functional.l1_loss(torch_output / output_scale, qnn_output / output_scale)
     off_by_N = (torch_output - qnn_output).abs().div(output_scale).round().max().item()

Please let me know if you have more doubts

May 31 '25 00:05 quic-kyunggeu