TFLite quantization, the int tflite model is much slower than the float tflite model
I converted Pytorch model to onnx model, and then to tensorflow model, I convert tensorflow model to float tflite model and int tflite model(with post training quantization). Currently, my int tflite model is much slower than my float tflite model.
BTW, my model is based on transformer architecture.
Here are the comparisons:
Pytorch: 0.04s, 0 mse, 439 MBs ONNX-Ori: 0.02s, 2.062e-12 mse, 437 MBs ONNX-Opt: 0.02s, 2.062e-12 mse, 437 MBs TFLite: 0.21s, 5.496e-12 mse, 551 MBs Quantized TFLite: 1.42s, 223.7 mse, 138 MBs
Could you please tell me what the reason is?
Hi @SkylerZheng,
Please provide more details, including model and reproduction code if possible.
If it is x86 CPU, there is an issue with per-channel quantized dynamic range quantization in interpreters built in TF < TF2.6. Try the newest tf-nightly and add flag
converter.experimental_disable_per_channel = True
during conversion.
On mobile CPU, it is possible you do not have newest optimizations -- try the latest tflite binary. For other hardware, please describe your hardware.
@daverim Hi, thank you very much for your help. Yes, I'm using x86 CPU, and I'm using the latest tf-nightly. I have a question, I installed tf-nightly with pip, do I have to build tensorflow from source with Bazel to get faster inference on x86 CPU? Also, how should I know which ops does not support quantizaiton and have to do convert int8 to to fp32 when the Op is operated?
I'm doing int8 quantization, does this flag (converter.experimental_disable_per_channel = True) help with full integer quantization? I tested, and found out that with this flag, the speed didnot change. For your reference, here is my tflite quantization code:
def representative_data_gen():
for i in range(10):
yield ort_inputs
converter = tf.lite.TFLiteConverter.from_saved_model(args.model_path_tf)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
converter.experimental_disable_per_channel = True #If it is x86 CPU, there is an issue with per-channel quantized dynamic range quantization in interpreters built in TF < TF2.6. try the newest tf-nightly and add flag
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8] # to ensure all ops have been fully quantized
# Set the input and output tensors to uint8 (APIs added in r2.3)
converter.inference_type = tf.int8
# converter.inference_input_type = tf.uint8
# converter.inference_output_type = tf.uint8
tflite_model_quant = converter.convert()
with open(args.model_path_tflite_quant, 'wb') as f:
f.write(tflite_model_quant)
while I was doing the tflite quantization, I saw this
"This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags."
Does it mean I have to build tensorflow from source?
It is possible to build tensorflow from source but if you are running TFLite, you probably only care about mobile performance, no? In this case, you should test performance of TFLite on device.
After disabling per-tensor quantization, is your model still slower?
I'm using tf-nightly right now (2.7 version), I tried with other stable versions, like 1.9 or 2.4, ended up with conversion errors. So I used the latest tf-nightly in the end, and the tf-nightly works well during conversion and quantization.
My question here is: is the slow inference speed caused by the tensorflow build method? I'm afraid that if I build tensorflow from source, I could end up with conversion failures.