self-supervised-speech-recognition icon indicating copy to clipboard operation
self-supervised-speech-recognition copied to clipboard

Pretraining larger models?

Open adithyaur99 opened this issue 4 years ago • 3 comments

"Please ensure that the architectures match.".format(filename) Exception: Cannot load model parameters from checkpoint /content/self-supervised-speech-recognition/wav2vec_small_960h.pt; please ensure that the architectures match.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Can I pretrain the different versions of the wav2vec with the same code?

adithyaur99 avatar Jan 19 '21 11:01 adithyaur99

Yes, you can do it, but you won't be able to leverage the pretrained model (training from scratch is computational expensive) If you want a larger model, my recommendation is to use the pretrain large model from https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_new.pt.

For pre-training: You need to point the --init_model arg to the large mode.
Try to decrease the batch size to avoid OOM problem.
replace these line:\

cmd.append("+optimization.update_freq='[" + str(int(64/NUM_GPU)) + "]'")
cmd.append("--config-name wav2vec2_base_librispeech")

by:

cmd.append("+optimization.update_freq='[" + str(int(128/NUM_GPU)) + "]'")
cmd.append("--config-name wav2vec2_large_librivox")

For fine-tuning: Edit this line: cmd.append("--config-name " + config_name) replace the config_name variable with the exact config you want from conf/finetuning (eg. vox_100h, vox_10h, ...)

mailong25 avatar Jan 19 '21 13:01 mailong25

When I try to decrease the batch size, I end up with this error.

AssertionError: Sentences lengths should not exceed max_tokens=120000

adithyaur99 avatar Jan 22 '21 11:01 adithyaur99

As I've found out through testing, the max_tokens should not be less than 16000 * nr_of_seconds_of_your_largest_wav_file.

TaridaGeorge avatar Mar 03 '21 13:03 TaridaGeorge