Incomplete transcription for longer audios in streaming asr
When longer audio(up to 40 seconds) is used for inference only for first 5-6 seconds transcripts are generated successfully but rest of the part of the audio has no or one word in the final transcript.
Can anybody help in understanding what parameters to be tweaked to address the issue?
Have already experimented with decoder_text_length_limit, encoded_feat_length_limit but no luck.
If you're using asr.sh for training there's a parameter max_wav_duration that's set to a quite low default value imho (only 20 seconds). My understanding is that this cuts any training examples from your training set that are longer than that and also the model doesn't really learn what to do with longer sequences if this value is too low. I've successfully trained my models with --max_wav_duration 120, you may need to merge a couple of your training utterances as well so that you have a couple long utterances in your training set as well. Memory consumption for training might also be higher but you can just lower the batch size accordingly if you're getting OOMs.
Even with max_wav_duration 120, I've also noticed that after 2 minutes, the output is severally degraded and basically unusable. That might be one of time embeddings going past what the model knows (?), but I can only speculate. Anyway, the solution for that is to finalize the utterance once in a while, i.e. run end pointing, then everything resets and you can decode another block up to your max length.
With that I was able to successfully run my German model with end-to-end punctuation for longer audio as well, see https://github.com/speechcatcher-asr/speechcatcher
Thank you sharing the details. We will try the suggested changes and rerun the experiments.
This issue is stale because it has been open for 45 days with no activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue is closed. Please re-open if needed.