how to generate audio dataset in SRT format ?
hey great repo, i am planning to do multilingual finetuning for whisper large v3 on 22 languages spanning around 10k+ hours of speech corpus. the issue is all of my transcriptions are in regular text format. and i am unsure on how to port them to SRT format. where small chunks of audio segments have a few words of transcription. (such that these small chunks are chopped off on either silence and/or context end).
given my corpus. how do i precisely produce SRT format timestamps ? i have a small hunch that i could use MMS/w2vBERT models to produce highly accurate word timestamps and then somehow convert these words timestamps into chunkable short sequences of text ie: SRT timestamps.
all help and guidance is welcomed and would mean a lot.