icefall Question about replicating normalization: decode.py vs streaming

I'm looking for hints about how to correctly replicate the exact normalization process that is applied during simulated streaming with decode.py in streaming_decode.py for the streaming zipformer.

I have been getting some really great results with the streaming zipformer and large datasets lately. I have been primarily using the default decode.py and simulated streaming for evals because it is very fast, and the delta between simulated and true streaming in the RESULTS.md pages appears consistent and pretty small. However I noticed that there appears to be a pretty big delta between what I get with decode.py and sherpa_onnx, and in particular that some utterances which decode with perfect or near perfect accuracy in the decode.py eval later produce empty hypotheses in sherpa_onnx. I started debugging this thinking it was something I had done in sherpa (still a distinct possibility) and found that applying some volume normalization via ffmpeg to the input could have a significant impact on the same utterances.

Next I tried to go a little further back and run streaming_decode.py to compare any possible differences there with the output I have been seeing with decode.py. Here I immediately ran into this audio.max assertion error when trying to decode my test set in icefall:

https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/streaming_decode.py#L346

        audio: np.ndarray = cut.load_audio()
        # audio.shape: (1, num_samples)
        assert len(audio.shape) == 2
        assert audio.shape[0] == 1, "Should be single channel"
        assert audio.dtype == np.float32, audio.dtype

        # The trained model is using normalized samples
        assert audio.max() <= 1, "Should be normalized to [-1, 1])"

I'm using the exact same cutset that performs very well with simulated streaming (via decode.py) but when I try to run it with streaming_decode.py it raises this assertion error. If I comment out the assertion, decoding runs and there is actually not much impact on WER (4.98% for simulated vs 5.04% for true streaming in streaming_decode.py), however I'd like to plug this gap.

I spent some time reviewing the code but I didn't find an obvious answer: how should I ensure that these same cuts are appropriately normalized for streaming_decode.py so as not to fire this assertion? I'm also wondering how/if this might play into the larger gap I'm seeing between these evals and the performance I see with sherpa-onnx (roughly 3% worse).

Apr 17 '23 10:04 AdolfVonKleist

Here I immediately ran into this audio.max assertion error when trying to decode my test set in icefall:

Are you using resampling ?

Apr 17 '23 10:04 csukuangfj

@csukuangfj yes my original training data (and test set) contains a mixture of different codecs and samplerates. In sherpa-onnx I am explicitly resampling all test data to 16khz as well. Do decode.py and streaming_decode.py behave differently in this regard?

Apr 17 '23 10:04 AdolfVonKleist

Both decode.py and streaming_decode.py use lhotse for resampling, which uses torchaduio internally for resampling.

sherpa-onnx uses its own resampling, LinearResampler from kaldi.

Maybe the difference comes from how resampling is implemented.

To verify that, could you find a wave that is not correctly recognized by sherpa-onnx but could be correctly recognized in icefall and you manually resample this wave using torchaudio and use sherpa-onnx to decode this resampled file?

Apr 17 '23 10:04 csukuangfj

@csukuangfj thank you for the hint. This resolved some, but not all of the differences. I will continue to debug it and see if I can find anything else. There seems to be quite a lot of sensitivity to this; I get slightly different results with each resampler (sox, ffmpeg with swr resampler, ffmpeg with sox resampler, and torchaudio [which I guess uses either sox_io or soundfile backends]).

Apr 17 '23 12:04 AdolfVonKleist

I encounter the same question，and my training data are original 16k, no any resample operation， when using simulated streaming (via decode.py with chunk-len=32) I get wer 5%，but when I use sherpa-online or sherpa-online-websocket-server, the wer is 8%, there is 3% absolute gap，and most of the gap is delete error，what's the possible reason for this? @csukuangfj @pingfengluo , is any progress of your debug? @AdolfVonKleist

Apr 21 '23 15:04 brainbpe

Where are the deletion errors? Are they mostly at the end of the utterance?

Apr 21 '23 15:04 csukuangfj

mostly at the start of the utterance

Apr 21 '23 16:04 brainbpe

and some short utterance produce empty hypotheses also

Apr 21 '23 16:04 brainbpe

@brainbpe I think I am having the same issue as you (with even worst wer differences), did you find out where it was coming from ? cc @AdolfVonKleist

Jun 07 '23 14:06 ezerhouni

@ezerhouni no, I don't have enough time to fix this. BTW: I find this problem in "sherpa"，and you find in "sherpa-onnx"，so I think this may be a bug in feature extract or in model export， cc @csukuangfj

Jun 09 '23 01:06 brainbpe

I stopped receiving messages from these threads for some reason. I am still seeing this issue and have not yet been able to resolve it. I see there are a couple of other mentions of similar issues with sherpa:

https://github.com/k2-fsa/sherpa/issues/74 (resolved by modifying the export command) <-- maybe this is the place to look?
https://github.com/k2-fsa/sherpa/issues/401 (still unresolved)

@csukuangfj the deletions seem to take place at the beginning and sometimes even in the middle. I managed to also somewhat consistently improve the results by playing with the volume normalization but this seems like an inappropriate approach.

@brainbpe @ezerhouni

Jun 23 '23 08:06 AdolfVonKleist

Hi, I am also finding the same issues as discussed above, but in sherpa-ncnn. Similarities include:

Deletions in the first few starting segments of the audio
Empty predictions in short utterances
Improvements by manually increasing the amplitude/volume of the audio (I manually increased +6db, this solved a lot of the empty predictions issue)

Hoping to see a solution for this issue soon! :)

Jun 26 '23 04:06 w11wo

@w11wo have you gotten any further with this or come up with any other ideas? I see the exact same, quite consistent behavior: incorporating an ffmpeg command into the pipeline that supports the volume=6dB (or whatever setting) consistently improves but not fully resolve this issue. I still think it may be a minor difference between the behavior of lhotse during the training process and the behavior of sherpa-onnx/ncnn/etc during inference. I also noticed that when resampling is employed there was often a significant difference in output just between utilization of swr versus soxr (the latter of which reproduces the sox resampling which I think is what lhotse is doing).

https://ffmpeg.org/ffmpeg-resampler.html

this combined with the volume tweaking/normalization seem to have significant impact on the outcome. So far however I have still failed to 100% replicate the results in sherpa-onnx. The issue seems to be very consistently to do with deletions and nothing else (which is also reported in 3-4 other similar issues as reported also be @ezerhouni and others). It would really be great to find a resolution to this as if I discount this issue the performance I'm currently getting out of sherpa/icefall/k2 is really, really amazing.

Jul 03 '23 08:07 AdolfVonKleist

Hi @AdolfVonKleist.

Unfortunately, I have not been able to figure this out. I provided a somewhat reproducible example here, which is still awaiting a response from the icefall team.

I think the example doesn't do it much justice though. Since I've seen worse deletions issues with my own, private models. But I do think the underlying issue is the same, regardless of the model used.

Moreover, testing on multiple mobile devices replicate the same issue. At times, we have to speak super loudly near the microphone of e.g. an iPad, to get it to recognize at all. But in other devices, this doesn't become an issue.

Jul 03 '23 08:07 w11wo

@AdolfVonKleist Did you get difference between decode.py and streaming_decode.py for a same audio?

Jul 03 '23 09:07 yaozengwei

I suggest to use the lastest recipe zipformer instead. In the old recipe pruned_transducer_stateless7_streaming, there might be some issues when doing the chunk-wise forward for the first chunks, since we did not mask out the initial zero states.

Jul 03 '23 09:07 yaozengwei

@yaozengwei I only note the differences with sherpa-onnx (I previously observed similar issues with ncnn but I moved away from this in the end as sherpa-onnx bindings tend to produce better RTFs in my experients).

latest recipe zipformer

I will do this and see if the issues disappear. Is there now support for the latest streaming zipformer in sherpa-onnx as well?

Jul 03 '23 13:07 AdolfVonKleist

Yes, there is.

Here are two pre-trained models of the latest streaming zipformer that you can play with in sherpa-onnx:

Chinese: https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/zipformer-transducer-models.html#pkufool-icefall-asr-zipformer-streaming-wenetspeech-20230615-chinese
English: https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/zipformer-transducer-models.html#csukuangfj-sherpa-onnx-streaming-zipformer-en-2023-06-26-english

Jul 03 '23 14:07 csukuangfj

perhaps it is about sox commands normalizing max amplitude to 1, which cannot be done online?

On Monday, July 3, 2023, Fangjun Kuang @.***> wrote:

Yes, there is.

Here are two pre-trained models of the latest streaming zipformer that you can play with in sherpa-onnx:

Chinese: https://k2-fsa.github.io/sherpa/onnx/pretrained_models/ online-transducer/zipformer-transducer-models.html# pkufool-icefall-asr-zipformer-streaming-wenetspeech-20230615-chinese https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/zipformer-transducer-models.html#pkufool-icefall-asr-zipformer-streaming-wenetspeech-20230615-chinese

English: https://k2-fsa.github.io/sherpa/onnx/pretrained_models/ online-transducer/zipformer-transducer-models.html# csukuangfj-sherpa-onnx-streaming-zipformer-en-2023-06-26-english https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/zipformer-transducer-models.html#csukuangfj-sherpa-onnx-streaming-zipformer-en-2023-06-26-english

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/1006#issuecomment-1618495150, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6LRVNJAHADSBWBZI3XOLLDFANCNFSM6AAAAAAXA7AUMI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Jul 03 '23 22:07 danpovey

I have finished re-training a large model with the new zipformer and can confirm the observation here:

https://github.com/k2-fsa/icefall/issues/1119#issuecomment-1625081339

that this resolves the imbalance related to insertions vs deletions; the deletions issue appears to be resolved by a combination of this update, and taking some care with the padding in the case of the streaming, and streaming onnx models. The average accuracy is also improved. BTW I continue to see streaming accuracy converge very closely to non-streaming when the chunk size (now chunk+left context in the new zipformer) are maxed out to 512-1024 for large corpora and long audio. It's a really simple alternative to chunking and realigning for long audio processing and might be worth considering for some of the users looking into that.

I'm still not 100% satisfied that I've sussed out the normalization but for now upgrading to the new zipformer is more than enough. Thanks for all the feedback on this one, and the great work as always.

Jul 18 '23 09:07 AdolfVonKleist

I suggest to use the lastest recipe zipformer instead. In the old recipe pruned_transducer_stateless7_streaming, there might be some issues when doing the chunk-wise forward for the first chunks, since we did not mask out the initial zero states.

How to fix this if I want to used pruned_transducer_stateless7_streaming for model export.

Aug 31 '23 08:08 LoganLiu66

I suggest to use the lastest recipe zipformer instead. In the old recipe pruned_transducer_stateless7_streaming, there might be some issues when doing the chunk-wise forward for the first chunks, since we did not mask out the initial zero states.

How to fix this if I want to used pruned_transducer_stateless7_streaming for model export.

Are you using sherpa or sherpa-ncnn or sherpa-onnx?

Aug 31 '23 08:08 csukuangfj

I suggest to use the lastest recipe zipformer instead. In the old recipe pruned_transducer_stateless7_streaming, there might be some issues when doing the chunk-wise forward for the first chunks, since we did not mask out the initial zero states.

How to fix this if I want to used pruned_transducer_stateless7_streaming for model export.

Are you using sherpa or sherpa-ncnn or sherpa-onnx?

No, when I try to use streaming_decode.py for decoding, I find worse result than decode.py (about absolute 2% on my own data). I also try to export the onnx using export-onnx.py and test using onnx_pretrained.py, it gets about absolute 3% decrease compared to decode.py.

Aug 31 '23 09:08 LoganLiu66

How do you invoke decode.py?

Aug 31 '23 09:08 csukuangfj

python ./pruned_transducer_stateless7_streaming/decode.py \
--epoch 999 \
--avg 1 \
--use-averaged-model 0 \
--beam-size 4 \
--exp-dir ${exp_dir} \
--lang-dir ${lang_dir} \
--max-duration 600 \
--decode-chunk-len 32 \
--decoding-method greedy_search

Aug 31 '23 09:08 LoganLiu66

@yaozengwei

could you take a look?

Does decode.py by default inference in a non-streaming way with a streaming model?

Aug 31 '23 09:08 csukuangfj

I suggest to use the lastest recipe zipformer instead. In the old recipe pruned_transducer_stateless7_streaming, there might be some issues when doing the chunk-wise forward for the first chunks, since we did not mask out the initial zero states.

How to fix this if I want to used pruned_transducer_stateless7_streaming for model export.

Are you using sherpa or sherpa-ncnn or sherpa-onnx?

No, when I try to use streaming_decode.py for decoding, I find worse result than decode.py (about absolute 2% on my own data). I also try to export the onnx using export-onnx.py and test using onnx_pretrained.py, it gets about absolute 3% decrease compared to decode.py.

Is there any clear error pattern when using streaming_decode.py? If it has more tail deletions, you could try a larger tail_pad_len, e.g., double decode_chunk_len, in https://github.com/k2-fsa/icefall/blob/8fcadb68a7cde093069e89830832e1ac728338fe/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/streaming_decode.py#L353

Aug 31 '23 11:08 yaozengwei

It seems to be an overall deterioration in performance. decode.py

%WER = 5.48
Errors: 46 insertions, 99 deletions, 231 substitutions, over 6862 reference words (6532 correct)

streaming_decode.py

%WER = 8.48
Errors: 101 insertions, 178 deletions, 303 substitutions, over 6862 reference words (6381 correct)

Moreover, I tested decode.py and streaming_decode.py on librispeech, it got the same WER. But when I switch to my own dataset, it got the above results.

I have change https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/decode_stream.py#L85 to

self.hyp = [-1] * (params.context_size - 1) + [params.blank_id]

because I got 40+% WER when inited with

self.hyp = [params.blank_id] * params.context_size

Aug 31 '23 12:08 LoganLiu66

please have a look at the errs-* file and see if there are any error patterns.

Aug 31 '23 12:08 csukuangfj

please have a look at the errs-* file and see if there are any error patterns.

It doesn't seem to have an error pattern.

Aug 31 '23 12:08 LoganLiu66

icefall
icefall copied to clipboard

Question about replicating normalization: decode.py vs streaming_decode.py (streaming zipformer)

icefall icefall copied to clipboard

Question about replicating normalization: decode.py vs streaming_decode.py (streaming zipformer)

icefall
icefall copied to clipboard