Finetuning with timestamps

Open sinisha opened this issue 2 years ago • 0 comments

I want to finetune Whisper using timestamps. There are some guidelines scattered around but I am still not sure about all steps. Could someone give some comments about following steps:

The transcriptions should be in following format <|0.0|>This is my transcription<|2.0|> ?
Should the values of timestamps be integer multiple of 0.02 (or how many decimal places we can use at all)?
Is it better to set timestamps according to actual speech content , i.e. perform VAD on the audio files ?
Is it beneficial to use timestamps in the middle of the sentence and what should be the format? Is this ok <|0.0|>This is my transcription<|2.0|><|2.0|>Some additional text<|3.5|> ?
Is it better to use timestamps for whole training material or just for some parts?
Do we need to set some special parameters in training config in case we are using timestamps?

Jun 09 '23 09:06 sinisha