WhisperLive
WhisperLive copied to clipboard
Timestamping transcriptions?
Would it be possible to add some option to delimit the transcribed output as timestamp-prefixed lines, or some other mark/metadata when each word occurs in the source media?
This is the way I was thinking I could hack it, if there isn't any way to surface this from the lower-level implementation:
- Split the audio into chunks of
lineDurationseconds (wherelineDurationis the number of seconds to elapse between each line, like 5 or 10). - Get the transcript for each of those spans of text.
- To ensure no words are getting cut on the clip boundary, produce a transcript for the
gapSpanlong seconds of audio on either side of the cut boundary (wheregapSpanis some amount of time we expect the transcription to become stable within: I would guess something like four seconds would probably be fine).- If the transcript of the seam section conflicts in its middle with the transcript of the two sections concatenated, replace the words (in roughly balanced proportion) at the ends of the lines with the transcribed words from the seam.
I see now that #211 links to a fork with word-level timestamps: it looks like someone still needs to submit a pull request?