audio
audio copied to clipboard
Add training recipes for HuBERT model pre-training and ASR fine-tuning
🚀 The feature
Hidden-Unit BERT (HuBERT), a self-supervised model for speech representations was proposed and wildly used in down-stream tasks, such as speech recognition, speech diarization, speaker identification, etc. It can achieve impressive Word Error Rate by fine-tuning on only 10 minutes of supervised data.
To fine-tune the HuBERT model for customized down-stream task, people need to install and adopt their training pipeline to fairseq. It will be great to add a training recipe to torchaudio that loads the torchaudio's HuBERT model and simply the training process.
Motivation, pitch
- [x] Add preprocessing scripts (MFCC feature extraction, KMeans model training, pseudo-label prediction).
- [x] Add a PyTorch-Lightning trainer for HuBERT Base model pre-training using MFCC features.
- [ ] Add a PyTorch-Lightning trainer for HuBERT Large model pre-training using HuBERT Base model representations.
- [ ] Add a PyTorch-Lightning trainer for HuBERT Large model fine-tuning on LibriSpeech ASR task.
Alternatives
No response
Additional context
No response