vosk-api First part of text returned on some lines missing using vosk-model-ja-0.22

First part of text returned on some lines missing using vosk-model-ja-0.22

Open coastal45 opened this issue 1 year ago • 6 comments

I'm using the vosk integration in SubtitleEdit (current 3.6.7). Basically, it works very well. But I've found one issue.

I noticed when doing audio to text using Japanese big model (vosk-model-ja-0.22), text for the first part of some spoken lines will be missing. Maybe once in 20 lines or so. Audio is clear enough in these cases, so I think vosk should have picked something up. It usually happens in the first part of a spoken phrase. Text elsewhere basically seems to be recognized well.

I reported this issue with SubtitleEdit on [https://github.com/SubtitleEdit/subtitleedit/issues/6171] "First part of text returned on some lines missing in Japanese audio to text #6171" Conclusion there was that it's a vosk issue. So I want to ask about it here.

Then, I noticed something on different samples exhibiting the same problem. The above problem occurred if the full audio sample (~45 min.) was processed. But when I clipped ~30 sec. of audio around the problem lines, and then ran vosk on them with SE, all the text was recognized. You can see the difference below:

Clip 1: full ~45 min. audio processed: ep03 clip 1243 full mkv waveform

30 sec. audio clip processed: ep03 clip 1243 short mkv waveform

Clip 2: full ~45 min. audio processed: ep03 clip 1749 full mkv waveform

30 sec. audio clip processed:

ep03 clip 1749 short mkv waveform

So I wonder if the sample length has any bearing on the text result. Or some other explanation? It would be helpful if these lines could be complete.

Sep 02 '22 19:09 coastal45

If you share the corresponding audio file I can take a look.

Sep 02 '22 20:09 nshmyrev

If you share the corresponding audio file I can take a look.

Thanks. Full 45 min. audio file size was too big to attach. So I shortened it to first 23 min. in "23 min clip.zip". Clip 1 (at 12:43) still shows the problem, but Clip 2 (at 17:49) is now correct. "audio 10 sec clips.zip" contains both 10 sec. clips. If you compare "23 min clip.aac" at 12:43, and "audio 10 sec clip 1243.aac" at 16 sec. you can see the difference.

Updated clips, sorry I forgot the srt files...

23 min clip.zip audio 10 sec clips.zip

Sep 03 '22 00:09 coastal45

Here's another example. This one didn't change with the audio file length, but clearly the first part of line 11 is missing any text.

Screenshot from 2022-09-02 21-37-40

03 1223 30 sec clip.zip

Sep 03 '22 04:09 coastal45

@nshmyrev Sorry to bother, but any ideas on this yet? It happens often enough, and it's always the beginning of a phrase. It seems as though there is some delay before actual decoding occurs. Or maybe too much BGM causing problems?

I can add more examples if that's helpful.

Sep 10 '22 05:09 coastal45

I didn't have time to look yet, sorry. I'll try to check during next week.

Sep 11 '22 18:09 nshmyrev

@nshmyrev Thanks. In another example (about 45 min.) I found 5 more examples of this issue. But only one I could reproduce with a short clip I can attach here. Same audio with differing sample lengths seems to make a difference somehow. This one was unchanged regardless. It's on line 39 @ 01:11. 2638.zip

Sep 13 '22 03:09 coastal45

vosk-api vosk-api copied to clipboard

First part of text returned on some lines missing using vosk-model-ja-0.22

vosk-api
vosk-api copied to clipboard