ParallelWaveGAN icon indicating copy to clipboard operation
ParallelWaveGAN copied to clipboard

Noisy speech samples of (Multi-band) MelGAN on (small) multispeaker dataset

Open unilight opened this issue 4 years ago • 16 comments

Hi, first of all, big thanks to all the people who helped develop the MelGAN model! The inference speed is super fast!

I modified the configs from the VCTK recipe to train a parallel_wavegan.v1 and a multi_band_melgan.v2 on the VCC2018 dataset, which contained 12 speakers * 81 utterances = 972 training utterances. The ana-syn samples are as follows. PWG (400k steps): https://drive.google.com/drive/folders/1buveb7V_nz7reWNCQsy2loxXjynVItkV?usp=sharing MelGAN (1000k steps): https://drive.google.com/drive/folders/1X5hrryxRL_txNtyE48Xw1gpzN7T24cB3?usp=sharing I found that MelGAN generates much more noisy speech samples, while PWG is pretty stable. I listened to the official samples on VCTK and found a similar trend: MelGAN is a little bit worse than PWG. (less noisy due to larger dataset?) Is this a known issue? Any suggestions on how to improve this?

unilight avatar Jun 15 '20 10:06 unilight

Could you paste your tensorboard log?

kan-bayashi avatar Jun 15 '20 12:06 kan-bayashi

image

I am not sure what the correct curve of MelGAN should look like, but the curve of PWG looks correct.

unilight avatar Jun 15 '20 16:06 unilight

It seems to be fine but the discriminator loss is a little bit small. I'm not sure this is the problem with the amount of training data. (Additionally, I'm not sure which is better PWG or MB-MelGAN in terms of the quality in general.) Did you try with a different lambda_adv?

kan-bayashi avatar Jun 17 '20 12:06 kan-bayashi

It seems to be fine but the discriminator loss is a little bit small. I'm not sure this is the problem with the amount of training data. (Additionally, I'm not sure which is better PWG or MB-MelGAN in terms of the quality in general.) Did you try with a different lambda_adv?

Hi @kan-bayashi, I have a similar task with @unilight (multi-speaker&few samples) It seems the clarity is good but the similarity is poor, and now I'm trying to train the parallel-waveGAN on only one speaker, but the loss is almost not decreasing. I would like to ask what parameters could I adjust to make the training continue to converge?

Approximetal avatar Jul 09 '20 10:07 Approximetal

the loss is almost not decreasing

Which loss value do you mean? The discriminator loss of PWG keeps the same value around 0.5.

kan-bayashi avatar Jul 09 '20 13:07 kan-bayashi

the loss is almost not decreasing

Which loss value do you mean? The discriminator loss of PWG keeps the same value around 0.5.

The figure is my tensorboard log. The generator loss drops in 1M is because I change lambda_adv to 2. But it seems doesn't work.

Approximetal avatar Jul 11 '20 07:07 Approximetal

@kan-bayashi I've tried several times by modifying learning rate and lambda_adv, but it always becomes overfitting after 1M, any idea to avoid this? Thanks.

Approximetal avatar Jul 13 '20 03:07 Approximetal

I think in the case of PWG 1M iterations are enough. Why don’t you try adaptation using good single speaker model? In my experiments, it works well with only 50k iters.

kan-bayashi avatar Jul 13 '20 04:07 kan-bayashi

I think in the case of PWG 1M iterations are enough. Why don’t you try adaptation using good single speaker model? In my experiments, it works well with only 50k iters.

Thanks for replying @kan-bayashi. I'm now training single speaker models, some speakers are clear, some have louder noise. As the result shows in single_speaker_inference_1000k.zip , the quality is not good enough. I'm not sure which is the main reason, lack of training, or limit of training data(only 70 sentences for each speaker), or the quality of training data.

Approximetal avatar Jul 13 '20 04:07 Approximetal

I'm now training single speaker models

Did you use pretrained model? 70 utterances are not enough to train from scratch. I think it is better to consider the use of pretrained scheme.

kan-bayashi avatar Jul 13 '20 09:07 kan-bayashi

I'm now training single speaker models

Did you use pretrained model? 70 utterances are not enough to train from scratch. I think it is better to consider the use of pretrained scheme.

The parameter of mel-spectrum (ZH 16kHz) is not fit for the pre-trained model. So I have to train it from scratch. I used to train the model using multi-speaker and multi-language for 840k, then finetune on a single speaker, the result is upload above (mel spectrum is generated by a voice conversion model). From yesterday, I followed this advice:

Why don’t you try adaptation using good single speaker model? In my experiments, it works well with only 50k iters.

It is 80k iters now, the quality hasn't reached the level of 1000k ones. I'll continue training to see when it achieves the best performance.

Approximetal avatar Jul 14 '20 03:07 Approximetal

@kan-bayashi When you trained PWGAN up to 50k with pretrained, when do you turn on the discriminator? From how many steps, or from the start? When I finetune female voice on LJSpeech it almost always sounds good after 20k steps, but male sounds bad.

ZDisket avatar Aug 19 '20 05:08 ZDisket

In my case, I use the discriminator from the first iter. Both male and female adaptation with the pretrained female model works well but it depends on the speaker's similarity. If you can have a good male speaker dataset, it is better to consider the creation of the single male speaker model.

kan-bayashi avatar Aug 19 '20 13:08 kan-bayashi

@kan-bayashi VCTK has some male speakers, can we finetune single speaker male on multi speaker?

ZDisket avatar Aug 19 '20 13:08 ZDisket

In my adaptation experiment using female dataset, multi-speaker-based adaptation is worse than single female speaker based adaptation. But if you do not have male model, it is worthwhile to try.

kan-bayashi avatar Aug 19 '20 13:08 kan-bayashi

I also meet that using MB-melgan to generate the noisy speech, especially in high frequency. I wonder if the Discriminator can not be downsample because of the suband processing.

hyysam avatar Sep 16 '20 09:09 hyysam