tensorflow-onnx
tensorflow-onnx copied to clipboard
Quantized model extra node emitted between Q-DQ pair
Describe the bug
When converting a quantized tflite mode to onnx, extra nodes (e.g. transpose, re-shape, etc.) got emitted between Q-DQ pairs. This prevents ORT graph optimizer to effectively fuse operators and achieve good performance.
Original issue from https://github.com/microsoft/onnxruntime/issues/14707
e.g. tflite model:
converted onnx model:
The transpose node should be either before the QuantizeLinear node or after the DequantizeLinear node for ORT graph optimizer to work.
tflite mode: https://github.com/microsoft/onnxruntime/files/10751803/quantized_tflite.zip
converted onnx model https://github.com/microsoft/onnxruntime/files/10751800/quantized_onnx.zip
Urgency
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 18.04*):
- TensorFlow Version:
- Python version:
- ONNX version (if applicable, e.g. 1.11*):
- ONNXRuntime version (if applicable, e.g. 1.11*):
To Reproduce
Screenshots
Additional context
Actually, this is a feature designed and implemented 2 years ago.
tf2onnx has an optimizer which will push DequantizeLinear down so that most of ops will be included between QuantizeLinear and DequantizeLinear pair. I guess the motivation was to lower down memory usage during inference.
Did you observe a big performance gap between the original onnx and the swapped onnx file mentioned in https://github.com/microsoft/onnxruntime/issues/14707?
If there is a big performance gap, probably we need to consider if this optimizer should be removed.
Yes, there is a huge performance drop when separation of Q-DQ node prevented operator fusion from working. for example:
https://github.com/microsoft/onnxruntime/issues/14707
an very simple model saw more than twice slower.
Hi folks, any update? @hoangtv2000