ParallelWaveGAN icon indicating copy to clipboard operation
ParallelWaveGAN copied to clipboard

voice adaptation

Open wac81 opened this issue 4 years ago • 23 comments

i have a people 50 sentences voice, could i train a WaveGan model with 50 sentences, Let the sound adapt the people?

wac81 avatar Nov 06 '20 08:11 wac81

In my experiments, the adaptation of PWG with 100 utterances works well. Maybe 50 utterances will work as well.

kan-bayashi avatar Nov 06 '20 12:11 kan-bayashi

retrain PWG on base model ? or train a new PWG model with 50 sentences?

wac81 avatar Nov 12 '20 02:11 wac81

Adaptation means retraining from the pretrained model.

kan-bayashi avatar Nov 12 '20 02:11 kan-bayashi

Multi-band MelGAN is more suitable for small data like 50 sentences?compare to PWG?

wac81 avatar Nov 12 '20 02:11 wac81

PWG is easier in the case of the small dataset. See discussion #171.

kan-bayashi avatar Nov 12 '20 03:11 kan-bayashi

ok,thank you i must train a model with MB-MelGAN + PWG discriminator for the base model first? then can be finetune?

i have a MB-MelGAN base model.

wac81 avatar Nov 19 '20 09:11 wac81

See https://github.com/kan-bayashi/ParallelWaveGAN/issues/171#issuecomment-676765007.

kan-bayashi avatar Nov 20 '20 07:11 kan-bayashi

I try the PWG finetune on-base PWG model with 70 sentences.

steps from default 400000 to 405000

but sound like the original model , any idea?

wac81 avatar Nov 25 '20 09:11 wac81

may be still finetune TTS model?

wac81 avatar Nov 25 '20 09:11 wac81

Please follow the steps:

  1. Make your own recipe by following https://github.com/kan-bayashi/ParallelWaveGAN/tree/master/egs#how-to-make-the-recipe-for-your-own-dateset
  2. Change the config for the adaptation as follows:
discriminator_train_start_steps: 0 # Number of steps to start to train discriminator.
train_max_steps: 50000             # Number of training steps.
  1. Run the recipe with --pretrain option

kan-bayashi avatar Nov 26 '20 10:11 kan-bayashi

hi @kan-bayashi i have done all things as your guide, but the result is the same as the original model.

  1. use csmsc PWG conf to train the base model and use modify PWG conf for finetune model from your guide
discriminator_train_start_steps: 0 # Number of steps to start to train discriminator.
train_max_steps: 50000             # Number of training steps.
  1. use --pretrain option like that: ./run.sh --pretrain "../../chinese_man/voc1/exp/train_nodev_csmsc_parallel_wavegan.v1/checkpoint-400000steps.pkl"
  2. got fintune model after train accomplish checkpoint='/data/ParallelWaveGAN/egs/chinese_man_vc/voc1/exp/train_nodev_parallel_wavegan.v1_train_nodev_csmsc_parallel_wavegan.v1/checkpoint-50000steps.pkl'
  3. i got 2 wave file,
    first-wave created by base model:https://drive.google.com/file/d/1vtIMHOaOTzTZ72hlKTc9sKU6pEom_xAc/view?usp=sharing second-wave created by finetuning model: https://drive.google.com/file/d/1GhwP68UJ1DzyMbCfExHTDvYhRM7tFQD5/view?usp=sharing
  4. there is my inference method:
text2speech = Text2Speech("/data/espnet/egs2/chinese_man/tts1/exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/config.yaml", 
                          "/data/espnet/egs2/chinese_man/tts1/exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/train.loss.best.pth",
                          speed_control_alpha=1.0)

text = '在搜查过程中,美军和艾迪德派民兵有轻微交火'

text2speech.spc2wav = None  # Disable griffin-lim

vocoder = load_model(checkpoint='/data/ParallelWaveGAN/egs/chinese_man/voc1/exp/train_nodev_csmsc_parallel_wavegan.v1/checkpoint-400000steps.pkl', 
                     ).to("cpu").eval()

vocoder.remove_weight_norm()

with torch.no_grad():
    start = time.time()
    wav, c, *_ = text2speech(text_preprocess(text))
    wav = vocoder.inference(c)
wav = wav.view(-1).cpu().numpy()

from scipy.io import wavfile
fs = 24000
wavfile.write('newname2.wav', fs, wav)
  1. and my finetune model config.yml is:
allow_cache: true
batch_max_steps: 25500
batch_size: 6
config: conf/parallel_wavegan.v1.yaml
dev_dumpdir: dump/dev/norm
dev_feats_scp: null
dev_segments: null
dev_wav_scp: null
discriminator_grad_norm: 1
discriminator_optimizer_params:
  eps: 1.0e-06
  lr: 5.0e-05
  weight_decay: 0.0
discriminator_params:
  bias: true
  conv_channels: 64
  in_channels: 1
  kernel_size: 3
  layers: 10
  nonlinear_activation: LeakyReLU
  nonlinear_activation_params:
    negative_slope: 0.2
  out_channels: 1
  use_weight_norm: true
discriminator_scheduler_params:
  gamma: 0.5
  step_size: 200000
discriminator_train_start_steps: 0
distributed: false
eval_interval_steps: 1000
fft_size: 2048
fmax: 7600
fmin: 80
format: hdf5
generator_grad_norm: 10
generator_optimizer_params:
  eps: 1.0e-06
  lr: 0.0001
  weight_decay: 0.0
generator_params:
  aux_channels: 80
  aux_context_window: 2
  dropout: 0.0
  gate_channels: 128
  in_channels: 1
  kernel_size: 3
  layers: 30
  out_channels: 1
  residual_channels: 64
  skip_channels: 64
  stacks: 3
  upsample_net: ConvInUpsampleNetwork
  upsample_params:
    upsample_scales:
    - 4
    - 5
    - 3
    - 5
  use_weight_norm: true
generator_scheduler_params:
  gamma: 0.5
  step_size: 200000
global_gain_scale: 1.0
hop_size: 300
lambda_adv: 4.0
log_interval_steps: 100
num_mels: 80
num_save_intermediate_results: 4
num_workers: 2
outdir: exp/train_nodev_parallel_wavegan.v1_train_nodev_csmsc_parallel_wavegan.v1
pin_memory: true
pretrain: ../../chinese_man/voc1/exp/train_nodev_csmsc_parallel_wavegan.v1/checkpoint-400000steps.pkl
rank: 0
remove_short_samples: true
resume: ''
sampling_rate: 24000
save_interval_steps: 1000
stft_loss_params:
  fft_sizes:
  - 1024
  - 2048
  - 512
  hop_sizes:
  - 120
  - 240
  - 50
  win_lengths:
  - 600
  - 1200
  - 240
  window: hann_window
train_dumpdir: dump/train_nodev/norm
train_feats_scp: null
train_max_steps: 50000
train_segments: null
train_wav_scp: null
trim_frame_size: 2048
trim_hop_size: 512
trim_silence: false
trim_threshold_in_db: 60
verbose: 1
version: 0.4.8
win_length: 1200
window: hann

What should I do next?

Thanks

wac81 avatar Nov 28 '20 14:11 wac81

I cannot listen to the second sample. Did you perform adaptation of Text2Mel model as well?

kan-bayashi avatar Nov 28 '20 14:11 kan-bayashi

sorry, I've given the second sample permission , you can open it try again.

Did you perform adaptation of Text2Mel model as well ? --you mean TTS model as this? no , i haven't yet to do:

text2speech = Text2Speech("/data/espnet/egs2/chinese_man/tts1/exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/config.yaml", 
                          "/data/espnet/egs2/chinese_man/tts1/exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/train.loss.best.pth",
                          speed_control_alpha=1.0)

i use conformer_fastspeech2 for TTS model, i have no idea use conformer_fastspeech2 for it because of the steps:

FastSpeech2 training
The procedure is almost the same as FastSpeech but we MUST use teacher forcing in decoding.

$ ./run.sh --stage 7 \
    --tts_exp exp/tts_train_raw_phn_tacotron_g2p_en_no_space \
    --inference_args "--use_teacher_forcing true" \
    --test_sets "tr_no_dev dev eval1"

how to generate new data for additional feature (F0 and energy) and use the previous dictionary

wac81 avatar Nov 28 '20 15:11 wac81

First, you need to clarify your problem. Please check the following points:

  • Basically, the speaker characteristics is decided by the text2mel model. Therefore, if you want to change the speaker of TTS model, you need to perform an adaptation of the text2mel model.
  • Please check the quality of the analysis-synthesis sample (use a natural mel for vocoder) with the adapted vocoder. If the quality is OK, the vocoder adaptation works well.

For the text2mel adaptation, a little bit complicated but you can do it:

  1. Perform adaptation of the teacher model (In your case Tacotron2) See https://github.com/espnet/espnet/blob/master/egs2/jvs/tts1/README.md
  2. Calculate duration, f0, and energy using the teacher model
  3. Perform adaptation of the FastSpeech2

kan-bayashi avatar Nov 29 '20 01:11 kan-bayashi

text2mel adaptation is my question, I will try it,thanks.

wac81 avatar Nov 30 '20 06:11 wac81

i got error by jvs guide: tts_train.py: error: unrecognized arguments: --init_param /data/espnet/egs2/chinese_man/tts1/exp/tts_train_raw_phn_pypinyin_g2p_phone/200epoch.pth but I use param "--pretrain_path" like this solve it: --pretrain_path /data/espnet/egs2/chinese_man/tts1/exp/tts_train_raw_phn_pypinyin_g2p_phone/200epoch.pth --pretrain_key null

may you want to modify readme?

wac81 avatar Nov 30 '20 07:11 wac81

No. The option is updated. Your version is old.

kan-bayashi avatar Nov 30 '20 08:11 kan-bayashi

I have done everything, and I get the correct result: https://drive.google.com/file/d/1GhwP68UJ1DzyMbCfExHTDvYhRM7tFQD5/view?usp=sharing

Thank you a lot

how to create the voice of style transfer?how to choose model?any suggestion with espnet?

wac81 avatar Dec 02 '20 08:12 wac81

how to create the voice of style transfer?

What do you mean the voice of style transfer?

kan-bayashi avatar Dec 04 '20 12:12 kan-bayashi

The early papers: https://arxiv.org/pdf/1710.11385.pdf and code: https://github.com/inzva/Audio-Style-Transfer

wac81 avatar Dec 07 '20 13:12 wac81

In the ESPnet, we have GST. You can try it. https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/tts1#supported-models

kan-bayashi avatar Dec 10 '20 11:12 kan-bayashi

ok,thanks a lot

wac81 avatar Dec 15 '20 16:12 wac81

Hi @kan-bayashi I tried the following: Make your own recipe by following https://github.com/kan-bayashi/ParallelWaveGAN/tree/master/egs#how-to-make-the-recipe-for-your-own-dateset Change the config for the adaptation as follows: discriminator_train_start_steps: 0 # Number of steps to start to train discriminator. train_max_steps: 50000 # Number of training steps.

then pretrained PWG v1 for 400k steps on my own recording of 50% of LJ data set. I finetuned it for 200K more steps.

Problems that I am facing is:

  1. my all losses are oscillating. Can you show me how all graphs should look like of good results
  2. generated voice is same as input voice. When I try this Vocoder on new audio, again the input is same as output.

What should i do?

mishra011 avatar Jun 02 '21 06:06 mishra011