aimet How to set the quantization parameters for bias?

If I deploy on the Qualcomm HTP and want the bias quantization to be 32 bits, but I can't find a parameter in Aimet's configuration file to set the bias quantization bit width. There are only settings for activation values and parameter values. Do you know where to address this issue?

Additionally, I've noticed that when converting from ONNX to DLC, some operations are transformed into other operators. For example, 1x1 1D convolution is forced to be converted into 1x1 2D convolution. Although the weights are the same, the statistical information obtained from QAT is discarded during the conversion. Do you know how to resolve this issue?

Sep 04 '23 03:09 xiexiaozheng

@xiexiaozheng , the way which AIMET models 32 bit bias quantization is by disabling the bias quantizers altogether. In the resulting exported encodings file, there will not be an entry for bias parameters that are disabled. When the encodings file is given to further APIs like SNPE/QNN to bring on target, biases will end up with 32 bit quantization.

An example of how to set bias quantizers to be disabled can be found in default_config_per_channel.json: https://github.com/quic/aimet/blob/develop/TrainingExtensions/common/src/python/aimet_common/quantsim_config/default_config_per_channel.json

Note in the 'params' section, how bias is specified to have 'is_quantized': False.

Regarding your second question, can you clarify what your current workflow is? Which framework are you starting off with? And are you using QAT and going through AIMET export to get an encodings and model file?

After export is finished, all of the necessary quantization parameters are contained in the encodings file. This should be all you need. When you say that the information is lost in conversion, how does this manifest itself? And how do you know it is causing an issue?

Sep 05 '23 20:09 quic-klhsieh

@xiexiaozheng , the way which AIMET models 32 bit bias quantization is by disabling the bias quantizers altogether. In the resulting exported encodings file, there will not be an entry for bias parameters that are disabled. When the encodings file is given to further APIs like SNPE/QNN to bring on target, biases will end up with 32 bit quantization.

An example of how to set bias quantizers to be disabled can be found in default_config_per_channel.json: https://github.com/quic/aimet/blob/develop/TrainingExtensions/common/src/python/aimet_common/quantsim_config/default_config_per_channel.json

Note in the 'params' section, how bias is specified to have 'is_quantized': False.

Regarding your second question, can you clarify what your current workflow is? Which framework are you starting off with? And are you using QAT and going through AIMET export to get an encodings and model file?

After export is finished, all of the necessary quantization parameters are contained in the encodings file. This should be all you need. When you say that the information is lost in conversion, how does this manifest itself? And how do you know it is causing an issue?

@quic-klhsieh Thank you very much for your detailed explanation. In my case, I am using aimet_torch1.25 to perform QAT on my model, and the target hardware platform for deployment is SM8550 HTP. The specific steps are as follows:

I use QuantizationSimModel to wrap the model, with the configuration set to 8-bit activation and 8-bit weights. Then, I initiate fine-tuning for QAT training. The code and configuration is as belows:

 quantsim = QuantizationSimModel(
                             model=prepared_model, 
                             quant_scheme=QuantScheme.training_range_learning_with_tf_init,
                             dummy_input=dummy_input, 
                             rounding_mode='nearest',
                             default_output_bw=8, 
                             default_param_bw=8, 
                             in_place=True,
                             config_file="/data/segmentation/MaskFormer/configs/aimet_qat/vit_config.json",
                             )

the config json is as below:

{
	"defaults": {
		"ops": {
			"is_output_quantized": "True"
		},
		"params": {
			"is_quantized": "True"
		},
		"strict_symmetric": "False",
		"unsigned_symmetric": "True",
		"per_channel_quantization": "False"
	},
	"params": {
		"bias": {
			"is_quantized": "False"
		}
	},
	"op_type": {},
	"supergroups": [
		{
			"op_list": [
				"Conv",
				"BatchNormalization"
			]
		}
	],
	"model_input": {
		"is_input_quantized": "True"
	},
	"model_output": {}
}

I export the model as an ONNX format model along with the corresponding model.encodings file. Upon comparison, I found that all the encodings for activation and weight values are included in the model.encoding file, except for some operators that have been split(for example, fully connected layers are split into matmul and add in the onnx, and the output of matmul is not quantization-encoded.)
I use the SNPE command-line tool snpe-onnx-to-dlc to convert the model. I pass the exported model and encoding file as parameters to the command line using --quantization_overrides=model.encodings. I've noticed that in the converted model, some operators' quantization information gets discarded. For example,

In my model, there is a Softmax layer. When this Softmax layer is converted to ONNX, transpose operations are inserted before and after it. However, when converting to DLC, the output encoding of the Softmax layer is empty, and only the transpose layer's output has encoding information.
My model contains fully connected layers. When converted to ONNX, these fully connected layers are split into matmul and add operations. The output encoding information is present in the add operation. However, when converting to DLC, matmul and add are merged back into a fully connected layer, and while the weight encoding information remains, the output encoding is discarded.
The model also includes 1D 1x1 convolutions. During the conversion to DLC, these 1D convolutions are transformed into 2D convolutions. Similarly, the weight encoding information is retained, but the output encoding information is lost.

Sep 06 '23 06:09 xiexiaozheng

@xiexiaozheng Can you let us know what version of SNPE you are using? Are you seeing the mismatched quantization encodings even after using the snpe-dlc-quantize tool with --override_params?

Sep 06 '23 18:09 quic-akinlawo

snpe-dlc-quantize tool with --override_params

@quic-akinlawo I have tried SNPE versions 2.14, 2.13, 2.10, and 2.09, and encountered the same issue during model conversion. When I used the snpe-dlc-quantize tool with the --override_params option, the missing statistics were supplemented with new statistics. However, combining the exported model.encoding with the statistics generated by snpe-dlc-quantize had a significant impact on performance.

The pipeline was shown as below:

BIN=${SNPE_PATH}/bin/x86_64-linux-clang/snpe-onnx-to-dlc
${BIN} \
  --input_network  ${onnx_model_path}\
  -d $INPUT_NAME $INPUT_DIM \
  --quantization_overrides=${override_params_path} \
  --debug \
  -o ${DLC}  2>&1 | tee info.log

BIN=${SNPE_PATH}/bin/x86_64-linux-clang/snpe-dlc-quant
${BIN} \
       --verbose \
       --log-file=./snpe-dlc-quant_log.txt \
       --input_dlc=${DLC} \
       --input_list=./raw_list.txt \
       --output_dlc=${quantized_dlc} \
       --bias_bitwidth=32 \
       --override_params

BIN=${SNPE_PATH}/bin/x86_64-linux-clang/snpe-dlc-graph-prepare
${BIN} \
       --verbose \
       --debug3 \
       --log-file=./snpe-dlc-graph-prepare_log.txt \
       --input_dlc=${quantized_dlc} \
       --output_dlc=${quantized_htp_dlc} \
       --htp_socs=sm8550

May I also ask if my target deployment platform is SM8550, and I'm deploying with SNPE 2.13, in the form of W8A8 bias=32bit, is my current QuantizationSimModel configuration reasonable?

Sep 07 '23 02:09 xiexiaozheng

@xiexiaozheng , the way which AIMET models 32 bit bias quantization is by disabling the bias quantizers altogether. In the resulting exported encodings file, there will not be an entry for bias parameters that are disabled. When the encodings file is given to further APIs like SNPE/QNN to bring on target, biases will end up with 32 bit quantization.

An example of how to set bias quantizers to be disabled can be found in default_config_per_channel.json: https://github.com/quic/aimet/blob/develop/TrainingExtensions/common/src/python/aimet_common/quantsim_config/default_config_per_channel.json

Note in the 'params' section, how bias is specified to have 'is_quantized': False.

Regarding your second question, can you clarify what your current workflow is? Which framework are you starting off with? And are you using QAT and going through AIMET export to get an encodings and model file?

After export is finished, all of the necessary quantization parameters are contained in the encodings file. This should be all you need. When you say that the information is lost in conversion, how does this manifest itself? And how do you know it is causing an issue?

@quic-klhsieh hello, Thank you for your explanation, but I still have a question. When I disabling the bias quantizers, there is no quantization information related to bias in the exported xxx.encodings file. In this case, snpe-dlc-quantize --overeide_params is used. Will the quantization of bias be filled by snpe instead of directly using the quantization parameters of aimet? Will this result in different results for the same test case in aimet and snpe?

Sep 11 '23 09:09 buxianggaimingzi

@quic-klhsieh At the beginning, when I configured the config_file for the QuantizationSimModel, I set up operator fusion to merge conv and relu. Consequently, in the exported encoding file, when both conv and relu were present simultaneously, the conv output encoding was empty, and only the relu output had encoding information. However, when I used snpe-dlc-quant to quantize the conv, the encoding information for the conv was supplemented. So, I have a question: Is it necessary to have encoding information between conv and relu? (I observed that when the model runs on the SM8550 HTP, the time taken by relu is 0, indicating that fusion might have occurred.)

Sep 12 '23 10:09 xiexiaozheng

@xiexiaozheng , it is true that SNPE will fill in the encodings for bias which does not show up in AIMET encodings. AIMET is not designed to be a bit exact simulation of target quantization, but instead to provide a simulated accuracy. With 32 bits for bias, the difference in simulation vs. target will be negligible.

@quic-akinlawo , could you comment on the question of snpe-dlc-quant filling in encodings within a fused Conv -> Relu pair?

Sep 12 '23 15:09 quic-klhsieh

@quic-klhsieh

snpe-dlc-quantize tool with --override_params

@quic-akinlawo I have tried SNPE versions 2.14, 2.13, 2.10, and 2.09, and encountered the same issue during model conversion. When I used the snpe-dlc-quantize tool with the --override_params option, the missing statistics were supplemented with new statistics. However, combining the exported model.encoding with the statistics generated by snpe-dlc-quantize had a significant impact on performance.

The pipeline was shown as below:
BIN=${SNPE_PATH}/bin/x86_64-linux-clang/snpe-onnx-to-dlc
${BIN} \
  --input_network  ${onnx_model_path}\
  -d $INPUT_NAME $INPUT_DIM \
  --quantization_overrides=${override_params_path} \
  --debug \
  -o ${DLC}  2>&1 | tee info.log

BIN=${SNPE_PATH}/bin/x86_64-linux-clang/snpe-dlc-quant
${BIN} \
       --verbose \
       --log-file=./snpe-dlc-quant_log.txt \
       --input_dlc=${DLC} \
       --input_list=./raw_list.txt \
       --output_dlc=${quantized_dlc} \
       --bias_bitwidth=32 \
       --override_params

BIN=${SNPE_PATH}/bin/x86_64-linux-clang/snpe-dlc-graph-prepare
${BIN} \
       --verbose \
       --debug3 \
       --log-file=./snpe-dlc-graph-prepare_log.txt \
       --input_dlc=${quantized_dlc} \
       --output_dlc=${quantized_htp_dlc} \
       --htp_socs=sm8550
May I also ask if my target deployment platform is SM8550, and I'm deploying with SNPE 2.13, in the form of W8A8 bias=32bit, is my current QuantizationSimModel configuration reasonable?

@quic-klhsieh Could you please share some opinion about this question?

Sep 12 '23 15:09 xiexiaozheng

@quic-klhsieh At the beginning, when I configured the config_file for the QuantizationSimModel, I set up operator fusion to merge conv and relu. Consequently, in the exported encoding file, when both conv and relu were present simultaneously, the conv output encoding was empty, and only the relu output had encoding information. However, when I used snpe-dlc-quant to quantize the conv, the encoding information for the conv was supplemented. So, I have a question: Is it necessary to have encoding information between conv and relu? (I observed that when the model runs on the SM8550 HTP, the time taken by relu is 0, indicating that fusion might have occurred.)

Hi @xiexiaozheng

Let me try and explain from my perspective. The runtime has logic for coalescing super-groups. In the case of Conv->Relu, this means the runtime will execute these as a single op and hence the quantization encodings at the output of Conv is not needed. AIMET models this as you have observed. SNPE adds a "token" encoding at the output of Conv since thats the interface with the runtime, every op has an associated output encoding. Since the conv and relu ops get fused by the runtime this "token" encoding will get discarded. So I think you should ignore that.

Let me know if the above makes sense or if you have more questions.

Sep 14 '23 22:09 quic-akhobare

May I also ask if my target deployment platform is SM8550, and I'm deploying with SNPE 2.13, in the form of W8A8 bias=32bit, is my current QuantizationSimModel configuration reasonable?

In my opinion you should always use 32-bit bias because there is no runtime overhead for using the same. In SNPE, there should be a flag for setting bias to 32 bits via the command line. Please check the SNPE docs for this. And please do specify this flag since the default is not 32 if I remember correctly.

Sep 14 '23 22:09 quic-akhobare

thanks for the pointers @quic-akhobare. You should consider mentioning that in the AIMET docs.

I have validated that using bias in 32 bits renders negligible latency performance wrt to 8 bit biases.

BTW, I wonder if AIMET could quantize the biases 8-bit biases? It's is a matter of setting

{
    "defaults":
    "params":
    {
      "bias":
      {
        "is_quantized": "False"
      }
    },
}

Mar 07 '24 18:03 escorciav

aimet aimet copied to clipboard

How to set the quantization parameters for bias?

aimet
aimet copied to clipboard