ParallelWaveGAN
ParallelWaveGAN copied to clipboard
Large dataset with multiple speakers bad quality at 400k
My dataset is 184G in the dump folder, and I have approximately 150 speakers with mixed songs and speech, and I used the default recipe settings. After the model finished training at 400k steps, the generator loss was still around 2.
The adversarial loss keeps increasing, and the generated samples are bad at the quality (nowhere comparable to the pretrained models). I was wondering if I should change the settings to make the model higher in capacity, for example, should I use the v3 configurations instead?
By the way, my goal is to train a universal encoder in Japanese for both males and females, for both speech and singing. The currently release pretrained models trained on JSUT only works for female speakers and does not work for songs.
Please paste your config and tensor board log.
Please paste your config and tensor board log.
@kan-bayashi I used the exact config as the template you provided here: https://github.com/kan-bayashi/ParallelWaveGAN/blob/master/egs/template_multi_spk/voc1/conf/parallel_wavegan.v1.yaml
Tensorboard logs:
It looks like the model has not yet converged, should I keep training it for longer steps? How to resume training if that's the case?
You can use --resume
option. e.g., run.sh --stage 2 --resume /path/to/checkpoint
.
The training curve seems normal. The generator loss will converge around 2.0.
You can see some figures in https://github.com/kan-bayashi/ParallelWaveGAN/blob/master/egs/README.md.
parallel_wavegan.v3
config uses PWG generator + MelGAn discriminatory but there is not so much difference in terms of quality.
I've never tried mixed (speech + singing) data, so let me check some points:
- Is it working well if you use only speech or singing data?
- Did you cut the unnecessary silence? It is better to cut it as much as possible.
You can use
--resume
option. e.g.,run.sh --stage 2 --resume /path/to/checkpoint
. The training curve seems normal. The generator loss will converge around 2.0. You can see some figures in https://github.com/kan-bayashi/ParallelWaveGAN/blob/master/egs/README.md.parallel_wavegan.v3
config uses PWG generator + MelGAn discriminatory but there is not so much difference in terms of quality.I've never tried mixed (speech + singing) data, so let me check some points:
- Is it working well if you use only speech or singing data?
- Did you cut the unnecessary silence? It is better to cut it as much as possible.
I didn’t try to separate speech and singing data. I trained the speech data on MelGAN and it works for some speakers but for others it’s a little unnatural in quality, that’s why I’m doing it with PWGAN.
I didn’t trim the silence either, should I remove all silent parts, even those between speech?
@kan-bayashi Here are some samples: https://drive.google.com/drive/u/0/folders/1OSjYflGt6TUOiPbalsabn7Gsh6zvII14, can you provide more suggestions on how to deal with this problem? Should I train on singing and speech data separately?
Should I remove all silent parts, even those between speech?
If you can, it is better to remove it. At least the beginning and the end of the audio.
can you provide more suggestions on how to deal with this problem? Should I train on singing and speech data separately?
Hmm. Not so good. Please provide the detail of the training data:
- Ratio of singing and speech
- Ratio of speaker gender
To clarify the problem, I list some ideas:
- Train singing model and speech model separately
- Train gender-dependent model
- Increase kernel_size of the generator (e.g., kernel_size=5), improving the performance slightly
It has weird "ripple-like" sounds in the generated songs, especially for notes with a long duration. I was wondering if issue #42 was solved because my problem right now is very similar to that one for singing.
As for speech, there was no specific pattern I would call wrong, but it's nowhere close to the pre-trained model quality. The pretrained model works for unseen speakers even better than my model where the speakers are seen (like JVS speakers).
I have no experiment with the singing data but I'm thinking about the necessity of a large receptive field for such a long duration part.
Should I remove all silent parts, even those between speech?
If you can, it is better to remove it. At least the beginning and the end of the audio.
can you provide more suggestions on how to deal with this problem? Should I train on singing and speech data separately?
Hmm. Not so good. Please provide the detail of the training data:
- Ratio of singing and speech
- Ratio of speaker gender
To clarify the problem, I list some ideas:
- Train singing model and speech model separately
- Train gender-dependent model
- Increase kernel_size of the generator (e.g., kernel_size=5), improving the performance slightly
@kan-bayashi For speech data, I'm using JVS + JSUT + around 10 hours of proprietary speech data, so in total it would be amount to 40 hours of speech. For singing data, I'm using various singers available online, about 10 in total, each has around 20 to 30 songs. I'm also using a monophonic singing dataset (no lyrics) with 10 hours in duration, so about 15 hours in total for singing. The ratio of singing:speech is about 3:8. Because of the use of JSUT, the gender ratio is unbalanced. I would say male:female is around 30%:70%.
I have no experiment with the singing data but I'm thinking about the necessity of a large receptive field for such a long duration part.
By receptive field, do you mean the receptive field of the discriminator? Should I use v3 configuration instead because it looks like it has a larger receptive field than v1, or do I simply change the kernel size and stay with v1? I will train on singing data separately and see what happens. I have already trained a MelGAN with speech data alone and it works well, not as good as ParallelGAN but not as bad as what I have now.
Thank you for your info. I'm considering the balance of the speaker and the type (speech or singing). In that case, JSUT will be dominant in the batch, which seems to affect the quality. Maybe you need to introduce a batch sampler to control the ratio in the batch.
By receptive field, do you mean the receptive field of the discriminator? Should I use v3 configuration instead because it looks like it has a larger receptive field than v1, or do I simply change the kernel size and stay with v1?
Since the convergence speed of v3
is too slow, I have done not so much experiments with v3
config.
But it is worthwhile to try if you can keep long training.
Increasing the generator kernel size brings slight improvement for every case.
I am also experimenting with voice + singing corpora. One of the issues is that the spectral loss is bad at tracking pitch (I have a publication in preprint I can share where I demonstrate this). I am experimenting with adapting the loss function better to model pitch.
I am also interested in this discussion, you are welcome to email me at lastname at gmail dot com.
@turian Thank you for your info. Then, F0-conditioned discriminator will work like #220.
@turian Thank you for your info. Then, F0-conditioned discriminator will work like #220.
@kan-bayashi I'm glad to hear you would like to implement that. I'm looking forward to hearing back good news from you. As for now, training separately on singing seems to be a little bit better, but the trained singing vocoder still has that "bump" even if I used filter_size = 5. I believe @turian 's suggestion might work, adding F0 loss may help penalize that noises mentioned in #42, because these noises interrupt the actual F0 of the output.
@kan-bayashi @turian I managed to solve the "bump" noise by adding F0 reconstruction loss there, it sounds much smoother than before. I'm now adding speech data to it and use the "pretrain" option to finetune the vocoder to work for both singing and speech.
That is great. How did you calculate F0 reconstruction loss?
That is great. How did you calculate F0 reconstruction loss?
@kan-bayashi I used this: https://github.com/maxrmorrison/torchcrepe Downsample the synthesized wave to 16k Hz (the pretrained model only takes 16k Hz), use the last layer of CNN as the F0 features, and take the MSE loss. It has an immediate effect on the generated songs, improved the results in 5k steps.
That is great. How did you calculate F0 reconstruction loss?
@kan-bayashi I used this: https://github.com/maxrmorrison/torchcrepe Downsample the synthesized wave to 16k Hz (the pretrained model only takes 16k Hz), use the last layer of CNN as the F0 features, and take the MSE loss. It has an immediate effect on the generated songs, improved the results in 5k steps.
Great Work! Can you share the model on 184G dataset?