faster-whisper timestamp not matching well when run transcribe on two mono audios split from stereo and assemble back

timestamp not matching well when run transcribe on two mono audios split from stereo and assemble back

Open junchen6072 opened this issue 1 year ago • 2 comments

It seems that timestamp from whisper is not very accurate, so when run two mono audios separately, and trying to assemble back the origin audio based on timestamp isn't very reliable.

It's easy to hack when one word's duration is long(e.g. > 3s), but there're some cases the word's duration is < 0.5s , especially when whisper could add some words.

Not sure if there's a good algorithm for this problem.

Apr 14 '23 00:04 junchen6072

Will try with predict speaker and not split

Apr 14 '23 05:04 junchen6072

that doesn't work well, I probably still need to split and improve the timestamps matchings -.-

Apr 14 '23 05:04 junchen6072

I'm doing the same thing right now, did you try this method?

Apr 28 '23 08:04 zyh3826

For Chinese，I conducted the test. WhisperX is not very good. Because I find pyannote/speaker-diarization-2.1 works better than pyannote/speaker-diarization-3.1. Whisperx uses the latter.

Jan 20 '24 02:01 liyaodev

faster-whisper faster-whisper copied to clipboard

timestamp not matching well when run transcribe on two mono audios split from stereo and assemble back

faster-whisper
faster-whisper copied to clipboard