seamless_communication icon indicating copy to clipboard operation
seamless_communication copied to clipboard

scripts to reproduce w2v-BERT 2.0 pretraining ?

Open StephennFernandes opened this issue 1 year ago • 2 comments

i have a bunch of private unlabelled speech corpuses for Indian language families, hence given that its an obvious choice that i would want to continually pre-train the w2v-BERT2.0 model on my extended dataset that i contain, to apparently try to yield maximum performance of downstream tasks.

hence, please if possible may i know if there is a pretraining script available for w2v-BERT 2.0 that i could possibly use to further pretrained checkpoint for improved performance ?

additionally could you also tell me ideal h-params to choose for ideal continual pretraining of the model

StephennFernandes avatar Jan 28 '24 20:01 StephennFernandes

I've already checked that, i need to pretrain on my unsupervised dataset to further get more performance on downstream tasks.

StephennFernandes avatar Jan 28 '24 20:01 StephennFernandes

The encoder Wav2Vec-BERT 2.0 has been trained on 4.5 million hours of unlabeled audios comprising of 143 languages, and the newer version was trained on more low-resource languages. Check the Section 3.2.1 of the Seamless' paper to know more about pre-training of this model. So the point is that training the encoder with your own data would just make it forget its vast vocabulary of speech patterns and its not recommended.

But even then, Check this model card of Wav2Vec-BERT 2.0 for more information on finetuning for your custom set of languages.

Awaisn25 avatar Feb 22 '24 05:02 Awaisn25