self-supervised-speech-recognition
self-supervised-speech-recognition copied to clipboard
Pretraining larger models?
"Please ensure that the architectures match.".format(filename) Exception: Cannot load model parameters from checkpoint /content/self-supervised-speech-recognition/wav2vec_small_960h.pt; please ensure that the architectures match.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Can I pretrain the different versions of the wav2vec with the same code?
Yes, you can do it, but you won't be able to leverage the pretrained model (training from scratch is computational expensive) If you want a larger model, my recommendation is to use the pretrain large model from https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_new.pt.
For pre-training:
You need to point the --init_model arg to the large mode.
Try to decrease the batch size to avoid OOM problem.
replace these line:\
cmd.append("+optimization.update_freq='[" + str(int(64/NUM_GPU)) + "]'")
cmd.append("--config-name wav2vec2_base_librispeech")
by:
cmd.append("+optimization.update_freq='[" + str(int(128/NUM_GPU)) + "]'")
cmd.append("--config-name wav2vec2_large_librivox")
For fine-tuning:
Edit this line:
cmd.append("--config-name " + config_name)
replace the config_name variable with the exact config you want from conf/finetuning (eg. vox_100h, vox_10h, ...)
When I try to decrease the batch size, I end up with this error.
AssertionError: Sentences lengths should not exceed max_tokens=120000
As I've found out through testing, the max_tokens should not be less than 16000 * nr_of_seconds_of_your_largest_wav_file.