TinyNeuralNetwork
TinyNeuralNetwork copied to clipboard
outputs are different between a QAT tflite and corresponding de-quantized onnx model
I got a QAT int8 per-channel tflite model. To check the accuracy, I compare the inference results between it and the de-quantized onnx model.
- python3.6 -m tf2onnx.convert --opset 11 --tflite test.tflite --output temp.onnx --dequantize python3.6 -m onnxsim temp.onnx test.onnx #test.onnx then holds float32 weights/bias
- Run test.onnx and test.tflite respectively, compare the inference results based on the same input. There are big differences between the results. The onnx model has much better inference results.
I'm not pretty clear about the inference flow of a QAT tflite. Not sure such situation is normal or not. It's supposed to have similar results to that of onnx model.
Attach onnx and tflite model for reference. test.zip
@liamsun2019 As we all know, quantization is not lossless. I think it's pointless to perform such kind of comparison. There will certainly be some differences between the results of the quantized kernels and that of the floating kernels. For example, take the subgraph in the following picture as an example.
This is common problem of imbalanced scale values of the two operands in the add operator. Ideally, they should be close. Otherwise, the one with the smaller scale value will somehow be ignored to some extent. Since symmetric quantization is applied here, I suggest that you try out the asymmetric quantization because it contains the offset values, so it can handle biased distributions better.
Actually, I did train an asymmetric per-channel QAT network based on the same source model. But the resulted tflite model has exactly the same weights/bias with symmetric per-channel.
@liamsun2019 Are you sure you use the following config for the quantizer?
quantizer = QATQuantizer(model, dummy_input, config={'asymmetric': True, 'per_tensor': False, ...})
I will double check that. The experiments were done a few days ago, perhaps not basing on the recent version.
I just tried an asymmetric per-channel QAT model. It turns out that some ops really have different scale/zero point. For instance, the add ops you illustrated. But the inference results are much worse than symmetric per-channel QAT model.
Here I summarize the scenario. I intend to finetune a pretrained full-precision model which works well. I freeze most layers and only train a few other layers. Take A as the tflite model with asymmetric per-channel and B as the tflite model with symmetric per-channel.
- A and B have the same weights for all the layers while the bias are different.
- A and B have different scale/zeropoint for some ops such as add.
- The dequantized onnx models have the same weights/bias and both work well.
- A has much worse inference results than B.
Item 1,2,3 are what's expected and should make sense. But item4 looks abnormal. I suppose A should be better. Any suggestions?
@liamsun2019 One thing is weird in your model. Actually I've set quant_min, quant_max as -127 and 127, but you can still see -128 in the weights.
Yes, I also found -128 in some weights. I notice that you really set quant_min=-127 and quant_max=127 in quantizer.py. My understanding is it's to avoid the risk of overflow. So what's the possible cause for this case?
Looks like we could not set quant_min and quant_max this way, the observer has its own logic for re-calculating them. https://github.com/pytorch/pytorch/blob/4a8d4cde6589178e989db89d576108ba6d3e6e9a/torch/ao/quantization/utils.py#L192
With https://github.com/alibaba/TinyNeuralNetwork/commit/e35ef92faba445830bb6156c916b1e838801c07e, the weights generated are within the range [-127, 127]. Would you please try again? BTW, I'm just curious how does the model perform during QAT training?
If things still don't work out with that patch, you may try bisecting the model, which should be fairly easy since you have the model descriptive script there.
I'm on the way. Meanwhile, what's the meaning of 'bisecting', could you please explain it in more detail?
Suppose you have the following model description file, you may return the intermediate tensors (e.g. a or b) instead of the original ones, so that you could figure out which part of the model is not working.
class Model(nn.Module):
def forward(self, x):
a = self.a(x)
b = self.b(a)
....
z = self.b(z)
return z
My experiment shows no -128 weights anymore for asymmetric per-channel case. Not sure if this is what you said "work out with that patch". Similar to former experiments, asymmetric per-channel performs much worse inference results compared to symmetric per-channel.
My experiment shows no -128 weights anymore for asymmetric per-channel case. Not sure if this is what you said "work out with that patch". Similar to former experiments, asymmetric per-channel performs much worse inference results compared to symmetric per-channel.
See https://github.com/alibaba/TinyNeuralNetwork/issues/25#issuecomment-1011722160. You may try bisecting to figure out which layer leads to accuracy loss.
OK. Just confirm 'bisecting' should be done against the original model or the quanatized (handled by QATQuantizer.quantize ) model ?
You can just do it on a trained quantized model. Just load the state dict with strict=False and it will be fine.
Before conducting the experiments, I have a few more questions since I am still not very clear about the point you mentioned
Suppose you have the following model description file, you may return the intermediate tensors (e.g.
aorb) instead of the original ones, so that you could figure out which part of the model is not working.class Model(nn.Module): def forward(self, x): a = self.a(x) b = self.b(a) .... z = self.b(z) return z
My current dilemma is that the int8 per-channel QAT tflite model has bad inference results compared to the de-quantized onnx model and u8 per-channel QAT even worse. Is the above sample for debugging the issue about which parts of the model contribute the most quantization accuracy loss?
@liamsun2019 Do you have a DingTalk account so that you can join our discussion group? This thread will grow too lengthy if you answer your questions here.
Sure, I will get that done.