sherpa-onnx
sherpa-onnx copied to clipboard
Issue in first word in zipformer2
I found that in many cases the first word of a audio is not able to get decoded properly. But when I use the .pth model in icefall instead of .onnx model, the word gets decoded properly.
Are you using the latest icefall to export the model and also are you using the latest sherpa-onnx for testing?
Yes. I updated both icefall and sherpa-onnx. Still facing the same issue.
Any update on this ?
Are you able to share the test wave file?
I observed this scenario in a model which I trained with custom data. The same audio might not behave the same way in another model (which is trained with different set of data).
Is there any pre trained model (.pth) available ? I will share a test wave file testing on that model.
Is there any pre trained model (.pth) available ? I will share a test wave file testing on that model.
Yes, please find the models in the RESULTS.md of each recipe in icefall, e.g., librispeech. For each experiment, there is a link to the huggingface repo containing pre-trained models.
I am attaching two audio files here.
1.wav: pth output(from icefall): GOD'S THE OLD SCHOOL STROKE OUT onnx output(from icefall): GUIDABLE SCHOOL STROKE OUT onnx output (from sherpa-onnx): GO AS TO WALK SCHOOL STROKE OUT (None of the outputs are matching)
2.wav pth output(from icefall): SYSTEM MENU OR PIANGARA AT YOU WITH THEIR KEY onnx output(from icefall): SYSTEM MENU OR PIANGARA AT YOU WITH THEIR KEY onnx output (from sherpa-onnx): SISTER MENU OR PEN GIRL AT YOU WHERE THEIR KEY (pth and onnx output is matching in icefall but sherpa onnx output is different)
I used the model streaming zipformer (zipformer + pruned stateless transducer) [https://huggingface.co/Zengwei/icefall-asr-librispeech-streaming-zipformer-2023-05-17/tree/main/exp] from huggingface for testing these audios.
Command to convert the model to onnx is
python3 ./zipformer/export-onnx-streaming.py
--exp-dir ./zipformer/exp
--tokens data/lang_bpe_500/tokens.txt
--causal 1
--chunk-size 16
--left-context-frames 128
--epoch 30
--avg 1
--use-averaged-model False
Any update on this issue?
Zero key/value cache from encoder initialization states may be causing this discrepancy between training and decoding?