faster-whisper
faster-whisper copied to clipboard
timestamp not matching well when run transcribe on two mono audios split from stereo and assemble back
It seems that timestamp from whisper is not very accurate, so when run two mono audios separately, and trying to assemble back the origin audio based on timestamp isn't very reliable.
It's easy to hack when one word's duration is long(e.g. > 3s), but there're some cases the word's duration is < 0.5s , especially when whisper could add some words.
Not sure if there's a good algorithm for this problem.
Will try with predict speaker and not split
that doesn't work well, I probably still need to split and improve the timestamps matchings -.-
I'm doing the same thing right now, did you try this method?
For Chinese,I conducted the test. WhisperX is not very good. Because I find pyannote/speaker-diarization-2.1 works better than pyannote/speaker-diarization-3.1. Whisperx uses the latter.