whisperX
whisperX copied to clipboard
Still some incoherent timestamps in the srt file
@m-bain Thank you so much for your amazing work. There are still some incoherent timestamps in the word-level srt files (it was the case for 139 files out of 360 on my data). I'm about to write a Python script to parse all the srt files and fix the concerned timestamps, but maybe there is a way to avoid them from the beginning? It makes it hard to convert them into TextGrid files... (I use https://github.com/rctatman/SrtToTextgrid ) Beside that, Whisperx is working so well!!
What do you mean by incoherent timestamps, could you be more specific, eg with example
He might be talking about the not completed diarize option. As whisper itself is incredible good at timestamps and is basically a complete package (except of a few languages). As the diarize option does not take into account cross-talking speakers. I myself just doubled the audio files length by using Ffmpeg '-filter:a', 'atempo=0.5'
and the diarize accuracy is usually within 1 second to switch speakers and can switch between cross-talking people accurately within 3 seconds. So every small talk answers like 'yes' and 'no', will be under the speaker who asked the question instead of the person answering. Which makes sense, but could be fixed almost completely, because Whisper detects a new sentence by itself far better than Pyannote does. So just having one or two worded sentences be a completely new speaker and then you can manually search for the speaker and replace with the name and ofcourse in the end divide the time in the output file by 2 with something like the 'Datetime' package.
@puresky07
Thank you both of you! It is about timestamps overlapping: segment B starts before segment A ends. I thought it was because of cross-talking too, I would understand the problem, but it is not. For example in the following example (from a .word.srt file):
62 00:00:37,316 --> 00:00:37,456 so
63 00:00:37,440 --> 00:00:37,500 so
64 00:00:38,745 --> 00:00:39,206 we
And the speaker doesn't repeat "so" at this particular moment (just a little pause).
Btw running WhisperX with the commit ba102fe (see issue #49), I had much less timestamps incoherence (only 13 files, against 139 in current version).
recent VAD filter push should hopefully handle this https://github.com/m-bain/whisperX/commit/a582a594932be9e7584afff1f266e15f2f59c383