sherpa-onnx icon indicating copy to clipboard operation
sherpa-onnx copied to clipboard

Issue in first word in zipformer2

Open bhaswa opened this issue 1 year ago • 10 comments

I found that in many cases the first word of a audio is not able to get decoded properly. But when I use the .pth model in icefall instead of .onnx model, the word gets decoded properly.

bhaswa avatar Sep 29 '23 08:09 bhaswa

Are you using the latest icefall to export the model and also are you using the latest sherpa-onnx for testing?

csukuangfj avatar Sep 29 '23 12:09 csukuangfj

Yes. I updated both icefall and sherpa-onnx. Still facing the same issue.

bhaswa avatar Sep 29 '23 13:09 bhaswa

Any update on this ?

bhaswa avatar Oct 03 '23 09:10 bhaswa

Are you able to share the test wave file?

csukuangfj avatar Oct 03 '23 14:10 csukuangfj

I observed this scenario in a model which I trained with custom data. The same audio might not behave the same way in another model (which is trained with different set of data).

bhaswa avatar Oct 04 '23 10:10 bhaswa

Is there any pre trained model (.pth) available ? I will share a test wave file testing on that model.

bhaswa avatar Oct 05 '23 11:10 bhaswa

Is there any pre trained model (.pth) available ? I will share a test wave file testing on that model.

Yes, please find the models in the RESULTS.md of each recipe in icefall, e.g., librispeech. For each experiment, there is a link to the huggingface repo containing pre-trained models.

csukuangfj avatar Oct 07 '23 01:10 csukuangfj

I am attaching two audio files here.

audios.zip

1.wav: pth output(from icefall): GOD'S THE OLD SCHOOL STROKE OUT onnx output(from icefall): GUIDABLE SCHOOL STROKE OUT onnx output (from sherpa-onnx): GO AS TO WALK SCHOOL STROKE OUT (None of the outputs are matching)

2.wav pth output(from icefall): SYSTEM MENU OR PIANGARA AT YOU WITH THEIR KEY onnx output(from icefall): SYSTEM MENU OR PIANGARA AT YOU WITH THEIR KEY onnx output (from sherpa-onnx): SISTER MENU OR PEN GIRL AT YOU WHERE THEIR KEY (pth and onnx output is matching in icefall but sherpa onnx output is different)

I used the model streaming zipformer (zipformer + pruned stateless transducer) [https://huggingface.co/Zengwei/icefall-asr-librispeech-streaming-zipformer-2023-05-17/tree/main/exp] from huggingface for testing these audios.

Command to convert the model to onnx is

python3 ./zipformer/export-onnx-streaming.py
--exp-dir ./zipformer/exp
--tokens data/lang_bpe_500/tokens.txt
--causal 1
--chunk-size 16
--left-context-frames 128
--epoch 30
--avg 1
--use-averaged-model False

bhaswa avatar Oct 10 '23 11:10 bhaswa

Any update on this issue?

bhaswa avatar Oct 16 '23 06:10 bhaswa

Zero key/value cache from encoder initialization states may be causing this discrepancy between training and decoding?

kamirdin avatar Dec 17 '23 06:12 kamirdin