deepvoice3_pytorch icon indicating copy to clipboard operation
deepvoice3_pytorch copied to clipboard

WIP: Support for Wavenet vocoder

Open r9y9 opened this issue 6 years ago • 7 comments

  • [x] Add script to generate training data for wavenet vocoder
  • [ ] Training deepvoice3 for wavenet vocoder
  • [ ] Training WaveNet vocoder
  • [x] Add a option to use WaveNet vocoder to synthesis.py
  • [ ] Get quality imprevement

ref #11, https://github.com/r9y9/wavenet_vocoder/issues/1

r9y9 avatar Jan 06 '18 12:01 r9y9

I'm just wondering, what kind of data should I pass to generate_aligned_predictions.py to produce aligned mel-predictions for WaveNet? Should these audio files be preprocessed somehow (as well as mel-spectrograms)?

nikitos9000 avatar Mar 03 '18 16:03 nikitos9000

This is very WIP so may change in future, but for now I use the following command:

python generate_aligned_predictions.py \
   ./checkpoints_deepvoice3_wavenet/checkpoint_step000770000.pth \  
   ~/Dropbox/sp/wavenet_vocoder/data/ljspeech/ --preset=presets/deepvoice3_ljspeech_wavenet.json \ 
    ~/Dropbox/sp/wavenet_vocoder/data/ljspeech_deepvoice3 

You need to pass:

  • Model checkpoint of DeepVoice3 (or similar)
  • Mel-spectrograms to be used for generate aligned predictions (inside ~/Dropbox/sp/wavenet_vocoder/data/ljspeech/ in my case). Raw audio is not used to generate predictions, but used to make sure we have correct time resolutions. https://github.com/r9y9/deepvoice3_pytorch/blob/096ed401fb826798cb3672dbaa40df0a85d758e3/generate_aligned_predictions.py#L102-L103

r9y9 avatar Mar 03 '18 16:03 r9y9

Okay, still quite alpha, but seems started to work.

DeepVoice3_wavenet_quite_alpha_770k_for_deepvoice3_6k_for_wavenet.zip

EDIT: Trained WaveNet for 60k steps, starting from pre-trained model https://github.com/r9y9/wavenet_vocoder/issues/19#issuecomment-366506397

r9y9 avatar Mar 03 '18 17:03 r9y9

@r9y9 Yes, thanks, I ran generate_aligned_predictions.py on deepvoice3 ljspeech data, not on wavenet data, so faced with some problems there. Now it's clear.

BTW, do you need any help with DeepVoice3 + WaveNet experiment? I reproduced your steps, but for now, it doesn't sound as good as in Baidu or Google demos (while WaveNet itself sounds very good on mels). So I'm wondering — what is the reason and what should we try to improve that. Do you have any ideas?

nikitos9000 avatar Mar 07 '18 09:03 nikitos9000

@nsmetanin Yes, I'm happy if you could help. I also haven't got as good results as Google demos. Currently I'm getting very coarse mel-spectrogram predictions with DeepVoice3 but I think we should be able to get sufficient precise mel-spectrogram, otherwise we may end up with noisy speech. I want to try outputs_per_step=1 as mentioned in Tacotron2 but I have an issue with the configuration (https://github.com/r9y9/deepvoice3_pytorch/issues/24). Attention encoder/decoder models are tricky to train...

I am planning to try increasing kernel_size, encoder/decoder channels of DeepVoice3 to make the model more expressive.

r9y9 avatar Mar 08 '18 14:03 r9y9

Also, there are parameters that should match both for DeepVoice3 output and WaveNet input, like preemphasis value, rescaling, and others. It wasn't clearly stated in those articles, what should we use, so I just want to try some combinations.

For example, if you trained WaveNet with rescaling=True and trying to put predictions of DeepVoice3 which was trained with rescaling=False, it will sound awful. Disabling preemphasis makes DeepVoice3 itself sound much worse, so that could a problem too. I want to try enabling preemphasis for mels both for DV3 and WV, and train WV to produce raw audio from mels with preemphasis.

nikitos9000 avatar Mar 13 '18 19:03 nikitos9000

Sorry, cant actually get what generate_aligned_predictions.py does. Can you clarify a bit? Do I need to train wavenet from original mels generated by wavenet preprocess? Or I can use mels generated by deepvoice preprocess? In case I need to use wavenet's preprocess what parameters should I copy so they will be the same as in deepvoice? P.S. I'm trying to train both models on my own dataset (not english). P.P.S. Sorry for silly questions :D

Misterion777 avatar May 07 '20 08:05 Misterion777