Awesome Singing Voice Synthesis and Singing Voice Conversion

A paper and project list about the cutting edge Speech Synthesis, Text-to-Speech (TTS), Singing Voice Synthesis (SVS), Voice Conversion (VC), Singing Voice Conversion (SVC), and related interesting works (such as Music Synthesis, Automatic Music Transcription, Automatic MOS Prediction, SSL-based ASR, ...etc).

Welcome to PR or contact me via email ([email protected]) for updating papers and works.

Paper List

Journals

IEEE/ACM TASLP, IEEE JSTSP, JSLHR, IEEE TPAMI

Conferences

NeuraIPS, ICLR, ICML, IJAI, AAAI, ACL, NAACL, EMNLP, ISMIR, ICASSP, INTERSPEECH, ACM MM, ICME

Workshops

ASRU, SLT

Singing Voice Conversion (Other Key Words: SVC, Singing Style Transfer)

Improving Adversarial Waveform Generation based Singing Voice Conversion with Harmonic Signals | ICASSP 2022 | 🎧Demo
Learn2Sing 2.0: Diffusion and Mutual Information-Based Target Speaker SVS by Learning from Singing Teacher | INTERSPEECH 2022 | ✔️Code | 🎧Demo
A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion | INTERSPEECH 2022 | 🎧Demo
Controllable and Interpretable Singing Voice Decomposition via Assem-VC | NeurIPS 2021 Workshop | 🎧Demo
DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion | ASRU 2021 | 🎧Demo
FastSVC: Fast Cross-Domain Singing Voice Conversion with Feature-wise Linear Modulation | ICME 2021 | 🎧Demo
Unsupervised WaveNet-based Singing Voice Conversion Using Pitch Augmentation and Two-phase Approach | 2021 | ✔️Code | 🎧Demo
Towards High-fidelity Singing Voice Conversion with Acoustic Reference and Contrastive Predictive Coding | 2021 | 🎧Demo
Zero-shot Singing Voice Conversion | ISMIR 2020 | 🎧Demo
PitchNet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network | ICASSP 2020 | 🎧Demo
DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System | INTERSPEECH 2020 | 🎧Demo
Unsupervised Cross-Domain Singing Voice Conversion | INTERSPEECH 2020 | 🎧Demo
VAW-GAN for Singing Voice Conversion with Non-parallel Training Data | APSIPA 2020 | ✔️Code | 🎧Demo
Phonetic Posteriorgrams based Many-to-Many Singing Voice Conversion via Adversarial Training | 2020 | 🎧Demo | Unofficial Code

Dateset

Singing Technique Conversion

Zero-shot Singing Technique Conversion | CMMR 2021

Voice Conversion (Other Key Words: VC, Voice Cloning, Voice Style Transfer)

End-to-End Zero-Shot Voice Style Transfer with Location-Variable Convolutions | 2022 | 🎧Demo
A Comparative Study of Self-supervised Speech Representation Based Voice Conversion | IEEE JSTSP 2022
Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme | ICLR 2022 | ✔️Code | 🎧Demo
YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone | ICML 2022 | ✔️Code | 🎧Demo | 🎧Demo| 📝Blog
S3PRL-VC: Open-Source Voice Conversion Framework with Self-Supervised Speech Representations | ICASSP 2022 | ✔️Code
A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion | ICASSP 2022 | ✔️Code | 🎧Demo
Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques | ICASSP 2022 | ✔️Code | 🎧Demo
NVC-Net: End-to-End Adversarial Voice Conversion | ICASSP 2022 | ✔️Code | 🎧Demo
Robust Disentangled Variational Speech Representation Learning for Zero-Shot Voice Conversion | ICASSP 2022 | 🎧Demo
Training Robust Zero-Shot Voice Conversion Models with Self-supervised Features | ICASSP 2022 | 🎧Demo
Toward Degradation-Robust Voice Conversion | ICASSP 2022
DGC-vector: A new speaker embedding for zero-shot voice conversion | ICASSP 2022 | 🎧Demo
Learning Noise-independent Speech Representation for High-quality Voice Conversion for Noisy Target Speakers | INTERSPEECH 2022 | 🎧Demo
Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion | INTERSPEECH 2022 | 🎧Demo
Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling | IEEE/ACM TASLP 2021 | ✔️Code | 🎧Demo
Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations | NeurIPS 2021 | 🎧Demo | Unofficial Code
Improving Zero-shot Voice Style Transfer via Disentangled Representation Learning | ICLR 2021
Global Rhythm Style Transfer Without Text Transcriptions | ICML 2021 | ✔️Code
AGAIN-VC: A One-shot Voice Conversion using Activation Guidance and Adaptive Instance Normalization | ICASSP 2021 | ✔️Code | 🎧Demo
StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion | INTERSPEECH 2021 Best Paper Award | ✔️Code | 🎧Demo
S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations | INTERSPEECH 2021 | ✔️Code | 🎧Demo
Many-to-Many Voice Conversion based Feature Disentanglement using Variational Autoencoder | INTERSPEECH 2021 | ✔️Code | 🎧Demo
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations | INTERSPEECH 2021 | 🎧Demo
On Prosody Modeling for ASR+TTS based Voice Conversion | ASRU 2021 | 🎧Demo
MediumVC: Any-to-any voice conversion using synthetic specific-speaker speeches as intermedium features | 2021 | ✔️Code | 🎧Demo
An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning | IEEE/ACM TASLP 2020
Unsupervised Speech Decomposition via Triple Information Bottleneck | ICML 2020 | ✔️Code
AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss | ICML 2019 | ✔️Code | 🎧Demo
One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization | INTERSPEECH 2019 | ✔️Code

Dateset

CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit | 2019 | 🔽Apply&Download |

Emotional Voice Conversion

Disentanglement of Emotional Style and Speaker Identity for Expressive Voice Conversion | INTERSPEECH 2022 | 🎧Demo
Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis | INTERSPEECH 2022 | 🎧Demo
Emotion Intensity and its Control for Emotional Voice Conversion | IEEE Transactions on Affective Computing | ✔️Code | 🎧Demo
Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training | INTERSPEECH 2021 | ✔️Code | 🎧Demo
Textless Speech Emotion Conversion using Discrete and Decomposed Representations | 2021 | 🎧Demo
Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion | INTERSPEECH 2020 | ✔️Code | 🎧Demo
Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data | Odyssey 2020 | ✔️Code | 🎧Demo

Dateset

Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset | ICASSP 2021 | 🔽Apply&Download | 🎧Demo

Singing Voice Synthesis (Other Key Words: SVS)

WeSinger 2: Fully Parallel Singing Voice Synthesis via Multi-Singer Conditional Adversarial Training | 2022 | 🎧Demo
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism | AAAI 2022 | ✔️Code | 🎧Demo
Learning the Beauty in Songs: Neural Singing Voice Beautifier | ACL 2022 | ✔️Code | 🎧Demo
Muskits: an End-to-End Music Processing Toolkit for Singing Voice Synthesis | INTERSPEECH 2022 | ✔️Code
SingAug: Data Augmentation for Singing Voice Synthesis with Cycle-consistent Training Strategy | INTERSPEECH 2022 | ✔️Code
WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses | INTERSPEECH 2022 | 🎧Demo
Sinsy: A Deep Neural Network-Based Singing Voice Synthesis System | IEEE/ACM TASLP 2021 | ✔️Code
HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis | 2020 | 🎧Demo

Dateset

M4Singer: a Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus | NeurIPS 2022 | 🔽Apply&Download | 🎧Demo
PopCS | AAAI 2022 | 🔽Apply&Download
Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis | INTERSPEECH 2022 | 🔽Apply&Download

High-Quality Speech Synthesis (Other Key Words: Text-to-Speech, TTS)

BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis | ICLR 2022 | ✔️Code | 🎧Demo
FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis | IJCAI 2022 | ✔️Code | 🎧Demo

Vocoder

Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus | ACM MM 2021 | 🔽Apply&Download | ✔️Code | 🎧Demo
Towards achieving robust universal neural vocoding | INTERSPEECH 2019 | ✔️Code | 🎧Demo | Unofficial Code

Music Synthesis/Music Synthesis

Multi-instrument Music Synthesis with Spectrogram Diffusion | ISMIR 2022 | ✔️Code | 🎧Demo
Musika! Fast Infinite Waveform Music Generation | ISMIR 2022 | ✔️Code | 🎧Demo

Automatic Music Transcription

MT3: Multi-Task Multitrack Music Transcription | ICLR 2022 | ✔️Code |
Omnizart: A General Toolbox for Automatic Music Transcription | The Open Journal 2021 | ✔️Code | 🎧Demo

Self-supervised/Unsupervised ASR

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing | IEEE JSTSP 2022 | ✔️Code | ✔️Code
UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware Pre-Training | ICASSP 2022 | ✔️Code | ✔️Code
Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition | ICASSP 2022 | ✔️Code | ✔️Code
Pseudo-Labeling for Massively Multilingual Speech Recognition | ICASSP 2022 | ✔️Code | ✔️Code
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units | IEEE/ACM TASLP 2021 | ✔️Code | ✔️Code
UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data | ICML 2021 | ✔️Code | ✔️Code | ✔️Code
XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale | 2021 | ✔️Code | ✔️Code
Simple and Effective Zero-shot Cross-lingual Phoneme Recognition | 2021 | ✔️Code | ✔️Code
TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech | IEEE/ACM TASLP 2020 | ✔️Code
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations | NeurIPS 2020 | ✔️Code | ✔️Code
vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations | ICLR 2020 | ✔️Code | ✔️Code
Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders | ICASSP 2020 | ✔️Code
fairseq S2T: Fast Speech-to-Text Modeling with fairseq | AACL 2020 | ✔️Code | ✔️Code
Unsupervised Cross-lingual Representation Learning for Speech Recognition | 2020 | ✔️Code | ✔️Code
Representation Learning with Contrastive Predictive Coding | 2019 | ✔️Code

Automatic MOS Prediction

The VoiceMOS Challenge 2022 | INTERSPEECH 2022
Utilizing Self-supervised Representations for MOS Prediction | INTERSPEECH 2021 | ✔️Code

Speech Data Augmentation

Data Augmenting Contrastive Learning of Speech Representations in the Time Domain | SLT 2021 | ✔️Code

Speech Insertion

RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion | INTERSPEECH 2022 | 🎧Demo

Prosody-Aware

Text-Free Prosody-Aware Generative Spoken Language Modeling | ACL 2022 | ✔️Code | 🎧Demo

Adversarial Attack

Defending Your Voice: Adversarial Attack on Voice Conversion | SLT 2021 | ✔️Code | 🎧Demo

Awesome-Singing-Voice-Synthesis-and-Singing-Voice-Conversion
Awesome-Singing-Voice-Synthesis-and-Singing-Voice-Conversion copied to clipboard

Metadata

Awesome Singing Voice Synthesis and Singing Voice Conversion

Paper List

Journals

Conferences

Workshops

Singing Voice Conversion (Other Key Words: SVC, Singing Style Transfer)

Dateset

Singing Technique Conversion

Voice Conversion (Other Key Words: VC, Voice Cloning, Voice Style Transfer)

Dateset

Emotional Voice Conversion

Dateset

Singing Voice Synthesis (Other Key Words: SVS)

Dateset

High-Quality Speech Synthesis (Other Key Words: Text-to-Speech, TTS)

Vocoder

Music Synthesis/Music Synthesis

Automatic Music Transcription

Self-supervised/Unsupervised ASR

Automatic MOS Prediction

Speech Data Augmentation

Speech Insertion

Prosody-Aware

Adversarial Attack

Toolkits

ASR Toolkits

TTS Toolkits

Music Processing Toolkits

Data Annotation/Alignment/ Toolkits

Other Frameworks and Toolkits

Competitions

References

← Metadata

Owner

Metadata

Awesome-Singing-Voice-Synthesis-and-Singing-Voice-Conversion Awesome-Singing-Voice-Synthesis-and-Singing-Voice-Conversion copied to clipboard

Metadata

Awesome Singing Voice Synthesis and Singing Voice Conversion

Paper List

Journals

Conferences

Workshops

Singing Voice Conversion (Other Key Words: SVC, Singing Style Transfer)

Dateset

Singing Technique Conversion

Voice Conversion (Other Key Words: VC, Voice Cloning, Voice Style Transfer)

Dateset

Emotional Voice Conversion

Dateset

Singing Voice Synthesis (Other Key Words: SVS)

Dateset

High-Quality Speech Synthesis (Other Key Words: Text-to-Speech, TTS)

Vocoder

Music Synthesis/Music Synthesis

Automatic Music Transcription

Self-supervised/Unsupervised ASR

Automatic MOS Prediction

Speech Data Augmentation

Speech Insertion

Prosody-Aware

Adversarial Attack

Toolkits

ASR Toolkits

TTS Toolkits

Music Processing Toolkits

Data Annotation/Alignment/ Toolkits

Other Frameworks and Toolkits

Competitions

References

← Metadata

Owner

Metadata

Awesome-Singing-Voice-Synthesis-and-Singing-Voice-Conversion
Awesome-Singing-Voice-Synthesis-and-Singing-Voice-Conversion copied to clipboard