vosk-api
vosk-api copied to clipboard
First part of text returned on some lines missing using vosk-model-ja-0.22
I'm using the vosk integration in SubtitleEdit (current 3.6.7). Basically, it works very well. But I've found one issue.
I noticed when doing audio to text using Japanese big model (vosk-model-ja-0.22), text for the first part of some spoken lines will be missing. Maybe once in 20 lines or so. Audio is clear enough in these cases, so I think vosk should have picked something up. It usually happens in the first part of a spoken phrase. Text elsewhere basically seems to be recognized well.
I reported this issue with SubtitleEdit on [https://github.com/SubtitleEdit/subtitleedit/issues/6171] "First part of text returned on some lines missing in Japanese audio to text #6171" Conclusion there was that it's a vosk issue. So I want to ask about it here.
Then, I noticed something on different samples exhibiting the same problem. The above problem occurred if the full audio sample (~45 min.) was processed. But when I clipped ~30 sec. of audio around the problem lines, and then ran vosk on them with SE, all the text was recognized. You can see the difference below:
Clip 1:
full ~45 min. audio processed:
30 sec. audio clip processed:
Clip 2:
full ~45 min. audio processed:
30 sec. audio clip processed:
So I wonder if the sample length has any bearing on the text result. Or some other explanation? It would be helpful if these lines could be complete.
If you share the corresponding audio file I can take a look.
If you share the corresponding audio file I can take a look.
Thanks. Full 45 min. audio file size was too big to attach. So I shortened it to first 23 min. in "23 min clip.zip". Clip 1 (at 12:43) still shows the problem, but Clip 2 (at 17:49) is now correct. "audio 10 sec clips.zip" contains both 10 sec. clips. If you compare "23 min clip.aac" at 12:43, and "audio 10 sec clip 1243.aac" at 16 sec. you can see the difference.
Updated clips, sorry I forgot the srt files...
Here's another example. This one didn't change with the audio file length, but clearly the first part of line 11 is missing any text.
@nshmyrev Sorry to bother, but any ideas on this yet? It happens often enough, and it's always the beginning of a phrase. It seems as though there is some delay before actual decoding occurs. Or maybe too much BGM causing problems?
I can add more examples if that's helpful.
I didn't have time to look yet, sorry. I'll try to check during next week.
@nshmyrev Thanks. In another example (about 45 min.) I found 5 more examples of this issue. But only one I could reproduce with a short clip I can attach here. Same audio with differing sample lengths seems to make a difference somehow. This one was unchanged regardless. It's on line 39 @ 01:11. 2638.zip