how to generate audio dataset in SRT format ?

Open StephennFernandes opened this issue 2 years ago • 0 comments

hey great repo, i am planning to do multilingual finetuning for whisper large v3 on 22 languages spanning around 10k+ hours of speech corpus. the issue is all of my transcriptions are in regular text format. and i am unsure on how to port them to SRT format. where small chunks of audio segments have a few words of transcription. (such that these small chunks are chopped off on either silence and/or context end).

given my corpus. how do i precisely produce SRT format timestamps ? i have a small hunch that i could use MMS/w2vBERT models to produce highly accurate word timestamps and then somehow convert these words timestamps into chunkable short sequences of text ie: SRT timestamps.

all help and guidance is welcomed and would mean a lot.

Apr 12 '24 09:04 StephennFernandes