TransferLearning-CLVC copied to clipboard
Transfer Learning from Monolingual ASR to Transcription-free Cross-lingual Voice Conversion
Imlementation of "Transfer Learning from Monolingual ASR to Transcription-free Cross-lingual Voice Conversion."
We provide our pretrained monolingual uni-directional acoustic model (in the ppg/ directory) and speaker encoder (in the spk_embedder/ directory) for reproduction of our multispeaker VC model. It may not generate the best result, but it's good enough.
All the VC data are from Voice Conversion Challenge 2020 and all the generated speech are submitted to the challenge for listening review, including intra-lingual and cross-lingual VC tasks.
Audio samples of our best model can be found here. For more details, please refer to our paper.
- python 3.6
- pytorch 1.1
- librosa
- h5py
- scipy
- tensorboardX
- apex
- Clone this repository.
- Access data from VCC 2020. Inside the "vcc2020_training" folder there should be 14 speakers, and in the "vcc2020_evaluation" folder there should be 4 source speakers.
- Prepare training data for Waveglow vocoder.
python --mode 0 -vcc "path_to_vcc2020_training"
This would generate an h5 file that concatenates all the speech for each speaker.
- Prepare training data for the conversion model.
python --mode 1
This would convert the speech into input features, d-vectors, and mel-spectrograms.
Training Waveglow vocoder
- (Optional) Modify the config_24k.json for hyperparameters.
- Run the training script
python -c config_24k.json
The training would take a few days. Please be patient.
Training the conversion model
Modify common/ for your desired checkpoint directory and hyperparameters. Be aware that the "n_symbols" can only be 72 or 514, depending on which feature you want to use.
Run the training script
Ideally it takes a few days. We stopped at the 30k to 50kth checkpoint.
- Run the testing script
python -vcc "path_to_vcc2020_evaluation" -ch "checkpoint_of_conversion_model" -m "ppg_model_you_used" -wg "waveglow_checkpoint" -o "vcc2020_evaluation/output_directory/"
converted wav files are in the output directory in the format of "target_source_wavname.wav"
- guanlongzhao's fac-via-ppg
- NVIDIA's Waveglow and Tacotron2
- pytorch's audio