Awesome-Speech-Pretraining
Awesome-Speech-Pretraining copied to clipboard
Paper, Code and Statistics for Self-Supervised Learning and Pre-Training on Speech.
Table of Contents generated with DocToc
-
Awesome-Speech-Pretraining
-
Papers
- 2018
- 2019
- 2020
- 2021
- 2022
- 2023
- Speech + Text
- SSL for Audio
- SSL for TTS
- SSL Model Distillation, Compression and Acceleration
- Resources
-
Statistics
-
wav2vec 2.0
- Pre-training
- Fine-tuning
- wav2vec-u
-
HuBERT
- Pre-training
- Fine-tuning
-
wav2vec 2.0
-
Papers
Awesome-Speech-Pretraining
Papers, Resources, and Statistics for Self-Supervised Learning and Pre-Training on Speech.
🌟 represents important papers.
Papers
2018
- 🌟 CPC: Representation Learning with Contrastive Predictive Coding - A Oord et al,
arXiv 2018
2019
- APC: An Unsupervised Autoregressive Model for Speech Representation Learning - YA Chung et al,
INTERSPEECH 2019
- 🌟 wav2vec: Unsupervised Pre-training for Speech Recognition - S Schneider et al,
INTERSPEECH 2019
- 🌟 vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations - A Baevski et al,
arXiv 2019, ICLR 2020
- MPC: Improving Transformer-based Speech Recognition Using Unsupervised Pre-training - D Jiang et al,
arXiv 2019
- PASE: Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks - S Pascual et al,
INTERSPEECH 2019
2020
- Bidir CPC: Learning robust and multilingual speech representations - K Kawakami et al,
EMNLP 2020
- Multi-target APC: Improved speech representations with multi-target autoregressive predictive coding - YA Chung et al,
ACL 2020
- Modified CPC: Unsupervised pretraining transfers well across languages - M Riviere et al,
ICASSP 2020
-
Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders - AT Liu et al,
ICASSP 2020
- vq-wav2vec-FT: Effectiveness of self-supervised pre-training for asr - A Baevski et al,
ICASSP 2020
- DeCoAR: Deep contextualized acoustic representations for semi-supervised speech recognition - S Ling et al,
ICASSP 2020
-
Improved noisy student training for automatic speech recognition - DS Park et al,
INTERSPEECH 2020
- 🌟 wav2vec 2.0: A framework for self-supervised learning of speech representations - A Baevski et al,
NeurIPS 2020
- Multi-lingual wav2vec 2.0: Unsupervised cross-lingual representation learning for speech recognition - A Conneau et al,
arXiv 2020
- Self-Training wav2vec 2.0: Self-training and Pre-training are Complementary for Speech Recognition - Q Xu et al,
arXiv 2020, ICASSP 2021
-
Decoar 2.0: Deep contextualized acoustic representations with vector quantization
arXiv 2020, ICASSP 2021
-
Pushing the limits of semi-supervised learning for automatic speech recognition - Y Zhang et al,
arXiv 2020, NeurIPS Workshop 2020
2021
-
Unispeech: Unified speech representation learning with labeled and unlabeled data- C Wang et al,
ACL 2021
-
Tera: Self-supervised learning of transformer encoder representation for speech - AT Liu et al,
TASLP 2021
-
Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training - WN Hsu et al,
INTERSPEECH 2021
- Zero-shot wav2vec 2.0: Simple and Effective Zero-shot Cross-lingual Phoneme Recognition - Q Xu et al,
arXiv 2021
- 🌟 wav2vec-U: Unsupervised Speech Recognition - A Baevski et al,
NeurIPS 2021
- 🌟 HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units - WN Hsu et al,
TASLP 2021
- 🌟 SUPERB: Speech processing Universal PERformance Benchmark - S Yang et al,
INTERSPEECH 2021
-
Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition - G Zheng et al,
EMNLP 2021
- ILS-SSL: Self-Supervised Learning for speech recognition with Intermediate layer supervision - C Wang et al,
ICASSP 2021
-
Wavlm: Large-scale self-supervised pre-training for full stack speech processing - S Chen et al,
arXiv 2021, JSTSP 2022
-
Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition - Y Zhang et al,
arXiv 2021, JSTSP 2022
-
Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing - J Ao et al,
arXiv 2021, ACL 2022
2022
- 🌟 Data2vec: A general framework for self-supervised learning in speech, vision and language - A Baevski et al,
ICML 2022
- BEST-RQ: Self-supervised Learning with Random-projection Quantizer for Speech Recognition - CC Chiu et al,
ICML 2022
-
SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities - HS Tsai et al,
ACL 2022
- 🌟 wav2vec-U 2.0: Towards End-to-end Unsupervised Speech Recognition - AH Liu et al,
SLT 2022
- c-siam: Contrastive Siamese Network for Semi-Supervised Speech Recognition - S Khorram et al,
ICASSP 2022
- Speech2C: Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data - J Ao et al,
INTERSPEECH 2022
-
SPIRAL: Self-supervised Perturbation-Invariant Representation Learning for Speech Pre-Training - W Huang et al,
ICLR 2022
-
Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages - F Wu et al,
arXiv 2022, ICASSP 2023
- HuBERT-AP: Speech Pre-training with Acoustic Piece - S Ren et al,
INTERSPEECH 2022
- PBERT: Supervision-Guided Codebooks for Masked Prediction in Speech Pre-training - C Wang et al,
INTERSPEECH 2022
- data2vec 2.0: Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language - A Baevski et al,
arXiv 2022
-
CoBERT: Self-Supervised Speech Representation Learning Through Code Representation Learning - C Meng et al,
arXiv 2022, INTERSPEECH 2023
-
MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets - Z Ma et al,
arXiv 2022, INTERSPEECH 2023
2023
-
CTCBERT: Advancing Hidden-unit BERT with CTC Objectives - R Fan et al,
ICASSP 2023
-
data2vec-aqc: Search for the right Teaching Assistant in the Teacher-Student training setup - VS Lodagala et al,
ICASSP 2023
- MonoBERT & PolyBERT: Pushing the Limits of Unsupervised Unit Discovery for SSL Speech Representation - Z Ma et al,
INTERSPEECH 2023
-
MCR-Data2vec 2.0: Improving Self-supervised Speech Pre-training via Model-level Consistency Regularization - JW Yoon et al,
INTERSPEECH 2023
Speech + Text
-
A general multi-task learning framework to leverage text data for speech to text tasks - Y Tang et al,
ICASSP 2021
-
SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training - A Bapna et al,
arXiv 2021
-
mSLAM: Massively multilingual joint pre-training for speech and text - A Bapna et al,
arXiv 2022
-
Optimizing Alignment of Speech and Language Latent Spaces for End-to-End Speech Recognition and Understanding - W Wang et al,
INTERSPEECH 2022
- STPT: Unified Speech-Text Pre-training for Speech Translation and Recognition - Y Tang et al,
ACL 2022
-
Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data - Y Kang et al,
AAAI 2022
- Distill-L2S: Distilling a Pretrained Language Model to a Multilingual ASR Model - K Choi et al,
INTERSPEECH 2022
-
SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training - Z Zhang et al,
EMNLP 2022
-
TESSP: Text-Enhanced Self-Supervised Speech Pre-training - Z Yao et al,
arXiv 2022
-
SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data - Z Zhang et al,
arXiv 2022
-
token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text - X Yue et al,
ICASSP 2023
SSL for Audio
- BYOL-A: BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation - D Niizumi et al,
IJCNN 2021
- Audio-MAE: Masked Autoencoders that Listen - H Xu et al,
NeurIPS 2022
-
MAE-AST: Masked Autoencoding Audio Spectrogram Transformer - A Baade et al,
INTERSPEECH 2022
-
BEATs: Audio Pre-Training with Acoustic Tokenizers - S Chen et al,
ICML 2023
- ATST: Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks - X Li et al,
arXiv 2023
-
EAT: Self-Supervised Pre-Training with Efficient Audio Transformer - W Chen et al,
arXiv 2024
SSL for TTS
-
Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks - R Eloff et al,
INTERSPEECH 2019
-
Unsupervised Learning For Sequence-to-sequence Text-to-speech For Low-resource Languages - H Zhang et al,
INTERSPEECH 2020
-
Towards Unsupervised Speech Synthesis - AH Liu et al,
NAACL 2022
SSL Model Distillation, Compression and Acceleration
-
DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT - H Chang et al,
ICASSP 2022
-
FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning- Y Lee et al,
INTERSPEECH 2022
-
LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT- R Wang et al,
INTERSPEECH 2022
-
Deep versus Wide: An Analysis of Student Architectures for Task-Agnostic Knowledge Distillation of Self-Supervised Speech Models - T Ashihara et al,
INTERSPEECH 2022
-
Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition - Y Wang et al,
arXiv 2022
-
Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech Representation Learning - G Yang et al,
ASRU 2023
Resources
Speech processing Universal PERformance Benchmark (SUPERB)
Self-Supervised Speech Pre-training and Representation Learning (S3PRL)
Statistics
Statistics on speech pretraining.
wav2vec 2.0
Pre-training
Size | Transformer | Samples | Batch Size | Train Time |
---|---|---|---|---|
BASE | 12 blocks, model dimension 768, FFN 3072, 8 heads | 1.4m(cropped)/GPU | 1.6h | 400k updates, 64 V100 * 1.6d |
LARGE | 24 blocks, model dimension 1024, FFN 4096, 16 heads | 1.2m(cropped)/GPU | 2.7h | 250k updates, 128 V100 * 2.3d(Librispeech) 600k updates, 128 V100 * 5.2d(LibriVox) |
Fine-tuning
wav2vec-u
Method | Feature Extractor | Batch Size | Train Time |
---|---|---|---|
wav2vec-U | wav2vec 2.0 LARGE | 160 unlabeled audio + 160 text samples | 150k steps, single V100 * 12h |
wav2vec-U + self training | wav2vec 2.0 LARGE | / | 80k updates, 8 V100(Librispeech) 13k updates, 4V100(TIMIT) |
HuBERT
Pre-training
Size | Feature Extractor | Batch Size | Stage | Train Time |
---|---|---|---|---|
BASE | wav2vec 2.0 BASE(95M) | 87.5s | 1: MFCC 250k steps 2: 6-th transformer layer 400k steps |
9.5h/100k steps, 32GPUs(Librispeech-960) |
LARGE | wav2vec 2.0 LARGE(317M) | 56.25s | 3: 9-th transformer layer from BASE HuBERT 400k steps | 9.5h/100k steps, 128GPUs(Libri-light-60k) |
X-LARGE | Conformer XXL(964M) | 22.5s | 3: 9-th transformer layer from BASE HuBERT 400k steps | 9.5h/100k steps, 256GPUs(Libri-light-60k) |