faster-whisper icon indicating copy to clipboard operation
faster-whisper copied to clipboard

timestamp not matching well when run transcribe on two mono audios split from stereo and assemble back

Open junchen6072 opened this issue 1 year ago • 2 comments

It seems that timestamp from whisper is not very accurate, so when run two mono audios separately, and trying to assemble back the origin audio based on timestamp isn't very reliable.

It's easy to hack when one word's duration is long(e.g. > 3s), but there're some cases the word's duration is < 0.5s , especially when whisper could add some words.

Not sure if there's a good algorithm for this problem.

junchen6072 avatar Apr 14 '23 00:04 junchen6072

Will try with predict speaker and not split

junchen6072 avatar Apr 14 '23 05:04 junchen6072

that doesn't work well, I probably still need to split and improve the timestamps matchings -.-

junchen6072 avatar Apr 14 '23 05:04 junchen6072

I'm doing the same thing right now, did you try this method?

zyh3826 avatar Apr 28 '23 08:04 zyh3826

For Chinese,I conducted the test. WhisperX is not very good. Because I find pyannote/speaker-diarization-2.1 works better than pyannote/speaker-diarization-3.1. Whisperx uses the latter.

liyaodev avatar Jan 20 '24 02:01 liyaodev