ParallelWaveGAN
ParallelWaveGAN copied to clipboard
voice adaptation
i have a people 50 sentences voice, could i train a WaveGan model with 50 sentences, Let the sound adapt the people?
In my experiments, the adaptation of PWG with 100 utterances works well. Maybe 50 utterances will work as well.
retrain PWG on base model ? or train a new PWG model with 50 sentences?
Adaptation means retraining from the pretrained model.
Multi-band MelGAN is more suitable for small data like 50 sentences?compare to PWG?
PWG is easier in the case of the small dataset. See discussion #171.
ok,thank you i must train a model with MB-MelGAN + PWG discriminator for the base model first? then can be finetune?
i have a MB-MelGAN base model.
See https://github.com/kan-bayashi/ParallelWaveGAN/issues/171#issuecomment-676765007.
I try the PWG finetune on-base PWG model with 70 sentences.
steps from default 400000 to 405000
but sound like the original model , any idea?
may be still finetune TTS model?
Please follow the steps:
- Make your own recipe by following https://github.com/kan-bayashi/ParallelWaveGAN/tree/master/egs#how-to-make-the-recipe-for-your-own-dateset
- Change the config for the adaptation as follows:
discriminator_train_start_steps: 0 # Number of steps to start to train discriminator.
train_max_steps: 50000 # Number of training steps.
- Run the recipe with
--pretrain
option
hi @kan-bayashi i have done all things as your guide, but the result is the same as the original model.
- use csmsc PWG conf to train the base model and use modify PWG conf for finetune model from your guide
discriminator_train_start_steps: 0 # Number of steps to start to train discriminator.
train_max_steps: 50000 # Number of training steps.
- use --pretrain option like that: ./run.sh --pretrain "../../chinese_man/voc1/exp/train_nodev_csmsc_parallel_wavegan.v1/checkpoint-400000steps.pkl"
- got fintune model after train accomplish checkpoint='/data/ParallelWaveGAN/egs/chinese_man_vc/voc1/exp/train_nodev_parallel_wavegan.v1_train_nodev_csmsc_parallel_wavegan.v1/checkpoint-50000steps.pkl'
- i got 2 wave file,
first-wave created by base model:https://drive.google.com/file/d/1vtIMHOaOTzTZ72hlKTc9sKU6pEom_xAc/view?usp=sharing second-wave created by finetuning model: https://drive.google.com/file/d/1GhwP68UJ1DzyMbCfExHTDvYhRM7tFQD5/view?usp=sharing - there is my inference method:
text2speech = Text2Speech("/data/espnet/egs2/chinese_man/tts1/exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/config.yaml",
"/data/espnet/egs2/chinese_man/tts1/exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/train.loss.best.pth",
speed_control_alpha=1.0)
text = '在搜查过程中,美军和艾迪德派民兵有轻微交火'
text2speech.spc2wav = None # Disable griffin-lim
vocoder = load_model(checkpoint='/data/ParallelWaveGAN/egs/chinese_man/voc1/exp/train_nodev_csmsc_parallel_wavegan.v1/checkpoint-400000steps.pkl',
).to("cpu").eval()
vocoder.remove_weight_norm()
with torch.no_grad():
start = time.time()
wav, c, *_ = text2speech(text_preprocess(text))
wav = vocoder.inference(c)
wav = wav.view(-1).cpu().numpy()
from scipy.io import wavfile
fs = 24000
wavfile.write('newname2.wav', fs, wav)
- and my finetune model config.yml is:
allow_cache: true
batch_max_steps: 25500
batch_size: 6
config: conf/parallel_wavegan.v1.yaml
dev_dumpdir: dump/dev/norm
dev_feats_scp: null
dev_segments: null
dev_wav_scp: null
discriminator_grad_norm: 1
discriminator_optimizer_params:
eps: 1.0e-06
lr: 5.0e-05
weight_decay: 0.0
discriminator_params:
bias: true
conv_channels: 64
in_channels: 1
kernel_size: 3
layers: 10
nonlinear_activation: LeakyReLU
nonlinear_activation_params:
negative_slope: 0.2
out_channels: 1
use_weight_norm: true
discriminator_scheduler_params:
gamma: 0.5
step_size: 200000
discriminator_train_start_steps: 0
distributed: false
eval_interval_steps: 1000
fft_size: 2048
fmax: 7600
fmin: 80
format: hdf5
generator_grad_norm: 10
generator_optimizer_params:
eps: 1.0e-06
lr: 0.0001
weight_decay: 0.0
generator_params:
aux_channels: 80
aux_context_window: 2
dropout: 0.0
gate_channels: 128
in_channels: 1
kernel_size: 3
layers: 30
out_channels: 1
residual_channels: 64
skip_channels: 64
stacks: 3
upsample_net: ConvInUpsampleNetwork
upsample_params:
upsample_scales:
- 4
- 5
- 3
- 5
use_weight_norm: true
generator_scheduler_params:
gamma: 0.5
step_size: 200000
global_gain_scale: 1.0
hop_size: 300
lambda_adv: 4.0
log_interval_steps: 100
num_mels: 80
num_save_intermediate_results: 4
num_workers: 2
outdir: exp/train_nodev_parallel_wavegan.v1_train_nodev_csmsc_parallel_wavegan.v1
pin_memory: true
pretrain: ../../chinese_man/voc1/exp/train_nodev_csmsc_parallel_wavegan.v1/checkpoint-400000steps.pkl
rank: 0
remove_short_samples: true
resume: ''
sampling_rate: 24000
save_interval_steps: 1000
stft_loss_params:
fft_sizes:
- 1024
- 2048
- 512
hop_sizes:
- 120
- 240
- 50
win_lengths:
- 600
- 1200
- 240
window: hann_window
train_dumpdir: dump/train_nodev/norm
train_feats_scp: null
train_max_steps: 50000
train_segments: null
train_wav_scp: null
trim_frame_size: 2048
trim_hop_size: 512
trim_silence: false
trim_threshold_in_db: 60
verbose: 1
version: 0.4.8
win_length: 1200
window: hann
What should I do next?
Thanks
I cannot listen to the second sample. Did you perform adaptation of Text2Mel model as well?
sorry, I've given the second sample permission , you can open it try again.
Did you perform adaptation of Text2Mel model as well ? --you mean TTS model as this? no , i haven't yet to do:
text2speech = Text2Speech("/data/espnet/egs2/chinese_man/tts1/exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/config.yaml",
"/data/espnet/egs2/chinese_man/tts1/exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/train.loss.best.pth",
speed_control_alpha=1.0)
i use conformer_fastspeech2 for TTS model, i have no idea use conformer_fastspeech2 for it because of the steps:
FastSpeech2 training
The procedure is almost the same as FastSpeech but we MUST use teacher forcing in decoding.
$ ./run.sh --stage 7 \
--tts_exp exp/tts_train_raw_phn_tacotron_g2p_en_no_space \
--inference_args "--use_teacher_forcing true" \
--test_sets "tr_no_dev dev eval1"
how to generate new data for additional feature (F0 and energy) and use the previous dictionary
First, you need to clarify your problem. Please check the following points:
- Basically, the speaker characteristics is decided by the text2mel model. Therefore, if you want to change the speaker of TTS model, you need to perform an adaptation of the text2mel model.
- Please check the quality of the analysis-synthesis sample (use a natural mel for vocoder) with the adapted vocoder. If the quality is OK, the vocoder adaptation works well.
For the text2mel adaptation, a little bit complicated but you can do it:
- Perform adaptation of the teacher model (In your case Tacotron2) See https://github.com/espnet/espnet/blob/master/egs2/jvs/tts1/README.md
- Calculate duration, f0, and energy using the teacher model
- Perform adaptation of the FastSpeech2
text2mel adaptation is my question, I will try it,thanks.
i got error by jvs guide:
tts_train.py: error: unrecognized arguments: --init_param /data/espnet/egs2/chinese_man/tts1/exp/tts_train_raw_phn_pypinyin_g2p_phone/200epoch.pth
but I use param "--pretrain_path" like this solve it:
--pretrain_path /data/espnet/egs2/chinese_man/tts1/exp/tts_train_raw_phn_pypinyin_g2p_phone/200epoch.pth --pretrain_key null
may you want to modify readme?
No. The option is updated. Your version is old.
I have done everything, and I get the correct result:
https://drive.google.com/file/d/1GhwP68UJ1DzyMbCfExHTDvYhRM7tFQD5/view?usp=sharing
Thank you a lot
how to create the voice of style transfer?how to choose model?any suggestion with espnet?
how to create the voice of style transfer?
What do you mean the voice of style transfer?
The early papers: https://arxiv.org/pdf/1710.11385.pdf and code: https://github.com/inzva/Audio-Style-Transfer
In the ESPnet, we have GST. You can try it. https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/tts1#supported-models
ok,thanks a lot
Hi @kan-bayashi I tried the following: Make your own recipe by following https://github.com/kan-bayashi/ParallelWaveGAN/tree/master/egs#how-to-make-the-recipe-for-your-own-dateset Change the config for the adaptation as follows: discriminator_train_start_steps: 0 # Number of steps to start to train discriminator. train_max_steps: 50000 # Number of training steps.
then pretrained PWG v1 for 400k steps on my own recording of 50% of LJ data set. I finetuned it for 200K more steps.
Problems that I am facing is:
- my all losses are oscillating. Can you show me how all graphs should look like of good results
- generated voice is same as input voice. When I try this Vocoder on new audio, again the input is same as output.
What should i do?