fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

Can I use the wav2vec2 base model to train my own dataset which is of 8000 sample rate?

Open nome2050 opened this issue 3 years ago • 5 comments

Can I use the wav2vec2 base model to train my own dataset which is of 8000 sample rate?

Kindly help me out?

also if yes then how much data do i need to annotate?

nome2050 avatar Feb 11 '21 12:02 nome2050

you probably need to change the striding in feature extractor so that your representation encode approximately the same amount of audio as it would if they were 16khz. i am not sure what the best configuration is, you would have to experiment. then you would have to pre-train the model on a large amount of unlabeled 8khz audio

another option that we use more often is simply upsample your 8khz audio to 16khz (using something like sox), and then just use the official recipe for pretraining

if you just want to finetune an existing model on 8khz audio you should probably upsample it first. the more audio you annotate the higher the final quality. if the domain differs a lot from the published models, the accuracy may not be that great though

alexeib avatar Feb 12 '21 19:02 alexeib

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale[bot] avatar Jun 28 '21 12:06 stale[bot]

It's affecting me :) I was able to Train a wav2vec 2.0 large model and Fine-tune a pre-trained model with CTC using fairseq-hydra-train with 8kHz wavs simply by changing all occurrences in the repo of 16000 and 16_000 to 8000 and 8_000, but

change the striding in feature extractor

could you provide more detail about what edits to make on what files? my current guess is:

fairseq/models/wav2vec/wav2vec.py:56:
-default="[(512, 10, 5), (512, 8, 4), (512, 4, 2),...
+default="[(512, 10, 4), (512, 8, 3), (512, 4, 2),...
                     ^            ^
 metadata={
     "help": "convolutional feature extraction layers [(dim, kernel_size, stride), ...]"

and/or change dim from 512 to 256.

webbp avatar Feb 10 '22 14:02 webbp

The better way is just to resample audio to 16 kHz using torchaudio

egorsmkv avatar Jun 02 '22 18:06 egorsmkv

The better way is just to resample audio to 16 kHz using torchaudio

Won't that make the forward pass much less efficient than if you were to use an 8 kHz model?

mattbonnell avatar Feb 07 '24 21:02 mattbonnell