fairseq
fairseq copied to clipboard
Can I use the wav2vec2 base model to train my own dataset which is of 8000 sample rate?
Can I use the wav2vec2 base model to train my own dataset which is of 8000 sample rate?
Kindly help me out?
also if yes then how much data do i need to annotate?
you probably need to change the striding in feature extractor so that your representation encode approximately the same amount of audio as it would if they were 16khz. i am not sure what the best configuration is, you would have to experiment. then you would have to pre-train the model on a large amount of unlabeled 8khz audio
another option that we use more often is simply upsample your 8khz audio to 16khz (using something like sox), and then just use the official recipe for pretraining
if you just want to finetune an existing model on 8khz audio you should probably upsample it first. the more audio you annotate the higher the final quality. if the domain differs a lot from the published models, the accuracy may not be that great though
This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!
It's affecting me :) I was able to Train a wav2vec 2.0 large model and Fine-tune a pre-trained model with CTC using fairseq-hydra-train with 8kHz wavs simply by changing all occurrences in the repo of 16000 and 16_000 to 8000 and 8_000, but
change the striding in feature extractor
could you provide more detail about what edits to make on what files? my current guess is:
fairseq/models/wav2vec/wav2vec.py:56:
-default="[(512, 10, 5), (512, 8, 4), (512, 4, 2),...
+default="[(512, 10, 4), (512, 8, 3), (512, 4, 2),...
^ ^
metadata={
"help": "convolutional feature extraction layers [(dim, kernel_size, stride), ...]"
and/or change dim from 512 to 256.
The better way is just to resample audio to 16 kHz using torchaudio
The better way is just to resample audio to 16 kHz using torchaudio
Won't that make the forward pass much less efficient than if you were to use an 8 kHz model?