self-supervised-speech-recognition icon indicating copy to clipboard operation
self-supervised-speech-recognition copied to clipboard

audio segmentation

Open Maria-Habib opened this issue 3 years ago • 1 comments

Hi... As recommended on GitHub, the best size of chunks is 10 to 30 seconds. However, the Librispeech dataset was split into various sizes starts from 2 secs. My question is what is the optimal chunk's size? and is it okay to pre-train on audios of different sizes and fine-tune on chunks of fixed sizes, or the opposite (fixed for pertaining and variable for fine-tuning)?

Further, when split the audios into chunks (ex. at a fixed size of 3 s), some spoken words might be lost, what is a better approach would be for splitting the audios? given that relying on silences results in a larger chunks size

Thanks in advance.

Maria-Habib avatar Oct 10 '21 18:10 Maria-Habib

hello , I am sorry to solve your question, and I want to ask that how can you git checkout c8a0....

blessyyyu avatar Oct 17 '21 13:10 blessyyyu