vosk-api icon indicating copy to clipboard operation
vosk-api copied to clipboard

First part of text returned on some lines missing using vosk-model-ja-0.22

Open coastal45 opened this issue 1 year ago • 6 comments

I'm using the vosk integration in SubtitleEdit (current 3.6.7). Basically, it works very well. But I've found one issue.

I noticed when doing audio to text using Japanese big model (vosk-model-ja-0.22), text for the first part of some spoken lines will be missing. Maybe once in 20 lines or so. Audio is clear enough in these cases, so I think vosk should have picked something up. It usually happens in the first part of a spoken phrase. Text elsewhere basically seems to be recognized well.

I reported this issue with SubtitleEdit on [https://github.com/SubtitleEdit/subtitleedit/issues/6171] "First part of text returned on some lines missing in Japanese audio to text #6171" Conclusion there was that it's a vosk issue. So I want to ask about it here.

Then, I noticed something on different samples exhibiting the same problem. The above problem occurred if the full audio sample (~45 min.) was processed. But when I clipped ~30 sec. of audio around the problem lines, and then ran vosk on them with SE, all the text was recognized. You can see the difference below:

Clip 1: full ~45 min. audio processed: ep03 clip 1243 full mkv waveform

30 sec. audio clip processed: ep03 clip 1243 short mkv waveform

Clip 2: full ~45 min. audio processed: ep03 clip 1749 full mkv waveform

30 sec. audio clip processed:

ep03 clip 1749 short mkv waveform

So I wonder if the sample length has any bearing on the text result. Or some other explanation? It would be helpful if these lines could be complete.

coastal45 avatar Sep 02 '22 19:09 coastal45

If you share the corresponding audio file I can take a look.

nshmyrev avatar Sep 02 '22 20:09 nshmyrev

If you share the corresponding audio file I can take a look.

Thanks. Full 45 min. audio file size was too big to attach. So I shortened it to first 23 min. in "23 min clip.zip". Clip 1 (at 12:43) still shows the problem, but Clip 2 (at 17:49) is now correct. "audio 10 sec clips.zip" contains both 10 sec. clips. If you compare "23 min clip.aac" at 12:43, and "audio 10 sec clip 1243.aac" at 16 sec. you can see the difference.

Updated clips, sorry I forgot the srt files...

23 min clip.zip audio 10 sec clips.zip

coastal45 avatar Sep 03 '22 00:09 coastal45

Here's another example. This one didn't change with the audio file length, but clearly the first part of line 11 is missing any text.

Screenshot from 2022-09-02 21-37-40

03 1223 30 sec clip.zip

coastal45 avatar Sep 03 '22 04:09 coastal45

@nshmyrev Sorry to bother, but any ideas on this yet? It happens often enough, and it's always the beginning of a phrase. It seems as though there is some delay before actual decoding occurs. Or maybe too much BGM causing problems?

I can add more examples if that's helpful.

coastal45 avatar Sep 10 '22 05:09 coastal45

I didn't have time to look yet, sorry. I'll try to check during next week.

nshmyrev avatar Sep 11 '22 18:09 nshmyrev

@nshmyrev Thanks. In another example (about 45 min.) I found 5 more examples of this issue. But only one I could reproduce with a short clip I can attach here. Same audio with differing sample lengths seems to make a difference somehow. This one was unchanged regardless. It's on line 39 @ 01:11. 2638.zip

coastal45 avatar Sep 13 '22 03:09 coastal45