TinyNeuralNetwork outputs are different between a QAT tflite and corresponding de-quantized onnx model

I got a QAT int8 per-channel tflite model. To check the accuracy, I compare the inference results between it and the de-quantized onnx model.

python3.6 -m tf2onnx.convert --opset 11 --tflite test.tflite --output temp.onnx --dequantize python3.6 -m onnxsim temp.onnx test.onnx #test.onnx then holds float32 weights/bias
Run test.onnx and test.tflite respectively, compare the inference results based on the same input. There are big differences between the results. The onnx model has much better inference results.

I'm not pretty clear about the inference flow of a QAT tflite. Not sure such situation is normal or not. It's supposed to have similar results to that of onnx model.

Attach onnx and tflite model for reference. test.zip

Jan 11 '22 08:01 liamsun2019

@liamsun2019 As we all know, quantization is not lossless. I think it's pointless to perform such kind of comparison. There will certainly be some differences between the results of the quantized kernels and that of the floating kernels. For example, take the subgraph in the following picture as an example. This is common problem of imbalanced scale values of the two operands in the add operator. Ideally, they should be close. Otherwise, the one with the smaller scale value will somehow be ignored to some extent. Since symmetric quantization is applied here, I suggest that you try out the asymmetric quantization because it contains the offset values, so it can handle biased distributions better.

Jan 11 '22 09:01 peterjc123

Actually, I did train an asymmetric per-channel QAT network based on the same source model. But the resulted tflite model has exactly the same weights/bias with symmetric per-channel.

Jan 11 '22 09:01 liamsun2019

@liamsun2019 Are you sure you use the following config for the quantizer?

quantizer = QATQuantizer(model, dummy_input, config={'asymmetric': True, 'per_tensor': False, ...})

Jan 11 '22 10:01 peterjc123

I will double check that. The experiments were done a few days ago, perhaps not basing on the recent version.

Jan 12 '22 00:01 liamsun2019

I just tried an asymmetric per-channel QAT model. It turns out that some ops really have different scale/zero point. For instance, the add ops you illustrated. But the inference results are much worse than symmetric per-channel QAT model.

Here I summarize the scenario. I intend to finetune a pretrained full-precision model which works well. I freeze most layers and only train a few other layers. Take A as the tflite model with asymmetric per-channel and B as the tflite model with symmetric per-channel.

A and B have the same weights for all the layers while the bias are different.
A and B have different scale/zeropoint for some ops such as add.
The dequantized onnx models have the same weights/bias and both work well.
A has much worse inference results than B.

Item 1,2,3 are what's expected and should make sense. But item4 looks abnormal. I suppose A should be better. Any suggestions?

Jan 12 '22 01:01 liamsun2019

@liamsun2019 One thing is weird in your model. Actually I've set quant_min, quant_max as -127 and 127, but you can still see -128 in the weights.

Jan 12 '22 08:01 peterjc123

Yes, I also found -128 in some weights. I notice that you really set quant_min=-127 and quant_max=127 in quantizer.py. My understanding is it's to avoid the risk of overflow. So what's the possible cause for this case?

Jan 12 '22 08:01 liamsun2019

Looks like we could not set quant_min and quant_max this way, the observer has its own logic for re-calculating them. https://github.com/pytorch/pytorch/blob/4a8d4cde6589178e989db89d576108ba6d3e6e9a/torch/ao/quantization/utils.py#L192

Jan 12 '22 09:01 peterjc123

With https://github.com/alibaba/TinyNeuralNetwork/commit/e35ef92faba445830bb6156c916b1e838801c07e, the weights generated are within the range [-127, 127]. Would you please try again? BTW, I'm just curious how does the model perform during QAT training?

Jan 12 '22 10:01 peterjc123

If things still don't work out with that patch, you may try bisecting the model, which should be fairly easy since you have the model descriptive script there.

Jan 13 '22 02:01 peterjc123

I'm on the way. Meanwhile, what's the meaning of 'bisecting', could you please explain it in more detail?

Jan 13 '22 02:01 liamsun2019

Suppose you have the following model description file, you may return the intermediate tensors (e.g. a or b) instead of the original ones, so that you could figure out which part of the model is not working.

class Model(nn.Module):
    def forward(self, x):
        a = self.a(x)
        b = self.b(a)
        ....
        z = self.b(z)
        return z

Jan 13 '22 02:01 peterjc123

My experiment shows no -128 weights anymore for asymmetric per-channel case. Not sure if this is what you said "work out with that patch". Similar to former experiments, asymmetric per-channel performs much worse inference results compared to symmetric per-channel.

Jan 13 '22 03:01 liamsun2019

My experiment shows no -128 weights anymore for asymmetric per-channel case. Not sure if this is what you said "work out with that patch". Similar to former experiments, asymmetric per-channel performs much worse inference results compared to symmetric per-channel.

See https://github.com/alibaba/TinyNeuralNetwork/issues/25#issuecomment-1011722160. You may try bisecting to figure out which layer leads to accuracy loss.

Jan 13 '22 04:01 peterjc123

OK. Just confirm 'bisecting' should be done against the original model or the quanatized (handled by QATQuantizer.quantize ) model ?

Jan 13 '22 04:01 liamsun2019

You can just do it on a trained quantized model. Just load the state dict with strict=False and it will be fine.

Jan 13 '22 04:01 peterjc123

Before conducting the experiments, I have a few more questions since I am still not very clear about the point you mentioned

Suppose you have the following model description file, you may return the intermediate tensors (e.g. a or b) instead of the original ones, so that you could figure out which part of the model is not working.
class Model(nn.Module):
    def forward(self, x):
        a = self.a(x)
        b = self.b(a)
        ....
        z = self.b(z)
        return z

My current dilemma is that the int8 per-channel QAT tflite model has bad inference results compared to the de-quantized onnx model and u8 per-channel QAT even worse. Is the above sample for debugging the issue about which parts of the model contribute the most quantization accuracy loss?

Jan 17 '22 02:01 liamsun2019

@liamsun2019 Do you have a DingTalk account so that you can join our discussion group? This thread will grow too lengthy if you answer your questions here.

Jan 17 '22 03:01 peterjc123

Sure, I will get that done.

Jan 17 '22 03:01 liamsun2019

TinyNeuralNetwork TinyNeuralNetwork copied to clipboard

outputs are different between a QAT tflite and corresponding de-quantized onnx model

TinyNeuralNetwork
TinyNeuralNetwork copied to clipboard