icefall icon indicating copy to clipboard operation
icefall copied to clipboard

Output not matching after exporting updated Zipformer model to Onnx

Open bhaswa opened this issue 2 years ago • 12 comments

Hi, I have trained latest streaming zipformer model with custom dataset and exported the model to onnx. When I compare the output from original pth model and the onnx model, a accuracy gap of 5% is found in the exported onnx model.

bhaswa avatar Jun 29 '23 09:06 bhaswa

a accuracy gap of 5% is found in the exported onnx model

Could you identify the wave files that cause inconsistent recognition results?

If yes, could you use one of them to compute the encoder output and compare whether the encoder output is the same for icefall and sherpa-onnx?

csukuangfj avatar Jun 29 '23 09:06 csukuangfj

Btw, I calculated the accuracy of onnx model using ./zipformer/onnx_pretrained-streaming.py, not sherpa-onnx.

bhaswa avatar Jun 29 '23 09:06 bhaswa

Btw, I calculated the accuracy of onnx model using ./zipformer/onnx_pretrained-streaming.py, not sherpa-onnx.

That is also ok. It is much easier to get the encoder output with /zipformer/onnx_pretrained-streaming.py.

csukuangfj avatar Jun 29 '23 10:06 csukuangfj

@csukuangfj output from the encoder layer is not matching. I checked it for two audios, for one audio recognition result is same and in other audio recognition result is different. Both the cases encoder output is not matching.

bhaswa avatar Jul 04 '23 05:07 bhaswa

@csukuangfj Any update on this ?

bhaswa avatar Jul 07 '23 06:07 bhaswa

output from the encoder layer is not matching

How large is the difference? If the input is the same, the encoder output should also be the same within some numeric tolerance.

csukuangfj avatar Jul 07 '23 07:07 csukuangfj

I double checked the output. Outputs are completely different from the encoder layer for . Infact the dimensions are not matching.

Dimension for pth: 1 x 16 x 256 Dimension for onnx: 1 x 16 x 512

bhaswa avatar Jul 07 '23 10:07 bhaswa

I double checked the output. Outputs are completely different from the encoder layer for . Infact the dimensions are not matching.

Dimension for pth: 1 x 16 x 256

Dimension for onnx: 1 x 16 x 512

Please apply joiner.ecoder_proj layer to the one whose dim is 512.

The ONNX version invokes joiner.ecoder_proj automatically.

csukuangfj avatar Jul 07 '23 11:07 csukuangfj

I double checked the output. Outputs are completely different from the encoder layer for . Infact the dimensions are not matching.

Dimension for pth: 1 x 16 x 256

Dimension for onnx: 1 x 16 x 512

Please apply joiner.ecoder_proj layer to the output of PyTorch.

The ONNX version invokes joiner.ecoder_proj automatically.

csukuangfj avatar Jul 07 '23 11:07 csukuangfj

After applying the joiner.ecoder_proj layer after encoder layer, now dimension is matching, but values are still different.

bhaswa avatar Jul 07 '23 13:07 bhaswa

but values are still different.

How large is the difference? You can use (a - b).abs().max() to get the max difference.

csukuangfj avatar Jul 07 '23 13:07 csukuangfj

  1. the number of times encoder is called in pth inference is different from onnx inference. all streaming codes are use FYI.

  2. for a 0.5 sec audio pth calls encoder 2 times whereas onnx it is called only 1 time.

sanjuktasr avatar Jul 11 '23 06:07 sanjuktasr