ParallelWaveGAN icon indicating copy to clipboard operation
ParallelWaveGAN copied to clipboard

Large dataset with multiple speakers bad quality at 400k

Open ghost opened this issue 3 years ago • 18 comments

My dataset is 184G in the dump folder, and I have approximately 150 speakers with mixed songs and speech, and I used the default recipe settings. After the model finished training at 400k steps, the generator loss was still around 2.

The adversarial loss keeps increasing, and the generated samples are bad at the quality (nowhere comparable to the pretrained models). I was wondering if I should change the settings to make the model higher in capacity, for example, should I use the v3 configurations instead?

By the way, my goal is to train a universal encoder in Japanese for both males and females, for both speech and singing. The currently release pretrained models trained on JSUT only works for female speakers and does not work for songs.

ghost avatar Nov 05 '20 08:11 ghost

Please paste your config and tensor board log.

kan-bayashi avatar Nov 05 '20 08:11 kan-bayashi

Please paste your config and tensor board log.

@kan-bayashi I used the exact config as the template you provided here: https://github.com/kan-bayashi/ParallelWaveGAN/blob/master/egs/template_multi_spk/voc1/conf/parallel_wavegan.v1.yaml

Tensorboard logs: train eval

It looks like the model has not yet converged, should I keep training it for longer steps? How to resume training if that's the case?

ghost avatar Nov 05 '20 09:11 ghost

You can use --resume option. e.g., run.sh --stage 2 --resume /path/to/checkpoint. The training curve seems normal. The generator loss will converge around 2.0. You can see some figures in https://github.com/kan-bayashi/ParallelWaveGAN/blob/master/egs/README.md. parallel_wavegan.v3 config uses PWG generator + MelGAn discriminatory but there is not so much difference in terms of quality.

I've never tried mixed (speech + singing) data, so let me check some points:

  • Is it working well if you use only speech or singing data?
  • Did you cut the unnecessary silence? It is better to cut it as much as possible.

kan-bayashi avatar Nov 05 '20 11:11 kan-bayashi

You can use --resume option. e.g., run.sh --stage 2 --resume /path/to/checkpoint. The training curve seems normal. The generator loss will converge around 2.0. You can see some figures in https://github.com/kan-bayashi/ParallelWaveGAN/blob/master/egs/README.md. parallel_wavegan.v3 config uses PWG generator + MelGAn discriminatory but there is not so much difference in terms of quality.

I've never tried mixed (speech + singing) data, so let me check some points:

  • Is it working well if you use only speech or singing data?
  • Did you cut the unnecessary silence? It is better to cut it as much as possible.

I didn’t try to separate speech and singing data. I trained the speech data on MelGAN and it works for some speakers but for others it’s a little unnatural in quality, that’s why I’m doing it with PWGAN.

I didn’t trim the silence either, should I remove all silent parts, even those between speech?

ghost avatar Nov 05 '20 21:11 ghost

@kan-bayashi Here are some samples: https://drive.google.com/drive/u/0/folders/1OSjYflGt6TUOiPbalsabn7Gsh6zvII14, can you provide more suggestions on how to deal with this problem? Should I train on singing and speech data separately?

ghost avatar Nov 06 '20 04:11 ghost

Should I remove all silent parts, even those between speech?

If you can, it is better to remove it. At least the beginning and the end of the audio.

can you provide more suggestions on how to deal with this problem? Should I train on singing and speech data separately?

Hmm. Not so good. Please provide the detail of the training data:

  • Ratio of singing and speech
  • Ratio of speaker gender

To clarify the problem, I list some ideas:

  • Train singing model and speech model separately
  • Train gender-dependent model
  • Increase kernel_size of the generator (e.g., kernel_size=5), improving the performance slightly

kan-bayashi avatar Nov 06 '20 04:11 kan-bayashi

It has weird "ripple-like" sounds in the generated songs, especially for notes with a long duration. I was wondering if issue #42 was solved because my problem right now is very similar to that one for singing. real

As for speech, there was no specific pattern I would call wrong, but it's nowhere close to the pre-trained model quality. The pretrained model works for unseen speakers even better than my model where the speakers are seen (like JVS speakers).

ghost avatar Nov 06 '20 04:11 ghost

I have no experiment with the singing data but I'm thinking about the necessity of a large receptive field for such a long duration part.

kan-bayashi avatar Nov 06 '20 05:11 kan-bayashi

Should I remove all silent parts, even those between speech?

If you can, it is better to remove it. At least the beginning and the end of the audio.

can you provide more suggestions on how to deal with this problem? Should I train on singing and speech data separately?

Hmm. Not so good. Please provide the detail of the training data:

  • Ratio of singing and speech
  • Ratio of speaker gender

To clarify the problem, I list some ideas:

  • Train singing model and speech model separately
  • Train gender-dependent model
  • Increase kernel_size of the generator (e.g., kernel_size=5), improving the performance slightly

@kan-bayashi For speech data, I'm using JVS + JSUT + around 10 hours of proprietary speech data, so in total it would be amount to 40 hours of speech. For singing data, I'm using various singers available online, about 10 in total, each has around 20 to 30 songs. I'm also using a monophonic singing dataset (no lyrics) with 10 hours in duration, so about 15 hours in total for singing. The ratio of singing:speech is about 3:8. Because of the use of JSUT, the gender ratio is unbalanced. I would say male:female is around 30%:70%.

ghost avatar Nov 06 '20 05:11 ghost

I have no experiment with the singing data but I'm thinking about the necessity of a large receptive field for such a long duration part.

By receptive field, do you mean the receptive field of the discriminator? Should I use v3 configuration instead because it looks like it has a larger receptive field than v1, or do I simply change the kernel size and stay with v1? I will train on singing data separately and see what happens. I have already trained a MelGAN with speech data alone and it works well, not as good as ParallelGAN but not as bad as what I have now.

ghost avatar Nov 06 '20 05:11 ghost

Thank you for your info. I'm considering the balance of the speaker and the type (speech or singing). In that case, JSUT will be dominant in the batch, which seems to affect the quality. Maybe you need to introduce a batch sampler to control the ratio in the batch.

By receptive field, do you mean the receptive field of the discriminator? Should I use v3 configuration instead because it looks like it has a larger receptive field than v1, or do I simply change the kernel size and stay with v1?

Since the convergence speed of v3 is too slow, I have done not so much experiments with v3 config. But it is worthwhile to try if you can keep long training. Increasing the generator kernel size brings slight improvement for every case.

kan-bayashi avatar Nov 06 '20 05:11 kan-bayashi

I am also experimenting with voice + singing corpora. One of the issues is that the spectral loss is bad at tracking pitch (I have a publication in preprint I can share where I demonstrate this). I am experimenting with adapting the loss function better to model pitch.

I am also interested in this discussion, you are welcome to email me at lastname at gmail dot com.

turian avatar Nov 08 '20 12:11 turian

@turian Thank you for your info. Then, F0-conditioned discriminator will work like #220.

kan-bayashi avatar Nov 09 '20 01:11 kan-bayashi

@turian Thank you for your info. Then, F0-conditioned discriminator will work like #220.

@kan-bayashi I'm glad to hear you would like to implement that. I'm looking forward to hearing back good news from you. As for now, training separately on singing seems to be a little bit better, but the trained singing vocoder still has that "bump" even if I used filter_size = 5. I believe @turian 's suggestion might work, adding F0 loss may help penalize that noises mentioned in #42, because these noises interrupt the actual F0 of the output.

ghost avatar Nov 10 '20 07:11 ghost

@kan-bayashi @turian I managed to solve the "bump" noise by adding F0 reconstruction loss there, it sounds much smoother than before. I'm now adding speech data to it and use the "pretrain" option to finetune the vocoder to work for both singing and speech.

ghost avatar Nov 12 '20 06:11 ghost

That is great. How did you calculate F0 reconstruction loss?

kan-bayashi avatar Nov 12 '20 06:11 kan-bayashi

That is great. How did you calculate F0 reconstruction loss?

@kan-bayashi I used this: https://github.com/maxrmorrison/torchcrepe Downsample the synthesized wave to 16k Hz (the pretrained model only takes 16k Hz), use the last layer of CNN as the F0 features, and take the MSE loss. It has an immediate effect on the generated songs, improved the results in 5k steps.

ghost avatar Nov 12 '20 07:11 ghost

That is great. How did you calculate F0 reconstruction loss?

@kan-bayashi I used this: https://github.com/maxrmorrison/torchcrepe Downsample the synthesized wave to 16k Hz (the pretrained model only takes 16k Hz), use the last layer of CNN as the F0 features, and take the MSE loss. It has an immediate effect on the generated songs, improved the results in 5k steps.

Great Work! Can you share the model on 184G dataset?

980202006 avatar Sep 01 '21 05:09 980202006