self-supervised-speech-recognition
self-supervised-speech-recognition copied to clipboard
audio segmentation
Hi... As recommended on GitHub, the best size of chunks is 10 to 30 seconds. However, the Librispeech dataset was split into various sizes starts from 2 secs. My question is what is the optimal chunk's size? and is it okay to pre-train on audios of different sizes and fine-tune on chunks of fixed sizes, or the opposite (fixed for pertaining and variable for fine-tuning)?
Further, when split the audios into chunks (ex. at a fixed size of 3 s), some spoken words might be lost, what is a better approach would be for splitting the audios? given that relying on silences results in a larger chunks size
Thanks in advance.
hello , I am sorry to solve your question, and I want to ask that how can you git checkout c8a0....