blog
blog copied to clipboard
Finetuning with timestamps
I want to finetune Whisper using timestamps. There are some guidelines scattered around but I am still not sure about all steps. Could someone give some comments about following steps:
- The transcriptions should be in following format <|0.0|>This is my transcription<|2.0|> ?
- Should the values of timestamps be integer multiple of 0.02 (or how many decimal places we can use at all)?
- Is it better to set timestamps according to actual speech content , i.e. perform VAD on the audio files ?
- Is it beneficial to use timestamps in the middle of the sentence and what should be the format? Is this ok <|0.0|>This is my transcription<|2.0|><|2.0|>Some additional text<|3.5|> ?
- Is it better to use timestamps for whole training material or just for some parts?
- Do we need to set some special parameters in training config in case we are using timestamps?