whisperX icon indicating copy to clipboard operation
whisperX copied to clipboard

Still some incoherent timestamps in the srt file

Open puresky07 opened this issue 2 years ago • 3 comments

@m-bain Thank you so much for your amazing work. There are still some incoherent timestamps in the word-level srt files (it was the case for 139 files out of 360 on my data). I'm about to write a Python script to parse all the srt files and fix the concerned timestamps, but maybe there is a way to avoid them from the beginning? It makes it hard to convert them into TextGrid files... (I use https://github.com/rctatman/SrtToTextgrid ) Beside that, Whisperx is working so well!!

puresky07 avatar Feb 01 '23 16:02 puresky07

What do you mean by incoherent timestamps, could you be more specific, eg with example

m-bain avatar Feb 01 '23 16:02 m-bain

He might be talking about the not completed diarize option. As whisper itself is incredible good at timestamps and is basically a complete package (except of a few languages). As the diarize option does not take into account cross-talking speakers. I myself just doubled the audio files length by using Ffmpeg '-filter:a', 'atempo=0.5' and the diarize accuracy is usually within 1 second to switch speakers and can switch between cross-talking people accurately within 3 seconds. So every small talk answers like 'yes' and 'no', will be under the speaker who asked the question instead of the person answering. Which makes sense, but could be fixed almost completely, because Whisper detects a new sentence by itself far better than Pyannote does. So just having one or two worded sentences be a completely new speaker and then you can manually search for the speaker and replace with the name and ofcourse in the end divide the time in the output file by 2 with something like the 'Datetime' package.

@puresky07

Dec1lent avatar Feb 01 '23 18:02 Dec1lent

Thank you both of you! It is about timestamps overlapping: segment B starts before segment A ends. I thought it was because of cross-talking too, I would understand the problem, but it is not. For example in the following example (from a .word.srt file):

62 00:00:37,316 --> 00:00:37,456 so

63 00:00:37,440 --> 00:00:37,500 so

64 00:00:38,745 --> 00:00:39,206 we

And the speaker doesn't repeat "so" at this particular moment (just a little pause).

puresky07 avatar Feb 02 '23 12:02 puresky07

Btw running WhisperX with the commit ba102fe (see issue #49), I had much less timestamps incoherence (only 13 files, against 139 in current version).

puresky07 avatar Feb 06 '23 12:02 puresky07

recent VAD filter push should hopefully handle this https://github.com/m-bain/whisperX/commit/a582a594932be9e7584afff1f266e15f2f59c383

m-bain avatar Apr 01 '23 20:04 m-bain