ParallelWaveGAN icon indicating copy to clipboard operation
ParallelWaveGAN copied to clipboard

Breakpoint problem

Open maozhiqiang opened this issue 4 years ago • 22 comments

hi @kan-bayashi , When using this project to train multi-band melgan vocoder, there will always be breakpoints in the generate audio! Trying to modify the kernal size of the first layer or the kernel size of the upsample laryers both This phenomenon cannot be eliminated. When I listen to the audio produced by the released model, there is a similar problem, For details, please refer to the attachment identification Screenshot from 2020-09-15 11-53-52 Screenshot from 2020-09-15 11-54-47 LJ050-0033_gen.zip Is there any way to eliminate it, thank you!

maozhiqiang avatar Sep 15 '20 05:09 maozhiqiang

Sorry for the late reply. Unfortunately, I have no clear idea to solve this problem. I wrote some comments:

  • Recently I fixed PQMF problem. It may be affected to the quality.
  • How about the increasing stacks to expand receptive field?
  • Is the breakpoint happened at the same position? (e.g, time or specific phoneme or etc...)

kan-bayashi avatar Sep 22 '20 09:09 kan-bayashi

Thank you for your

  • new PQMF was not resolved this problem!,

  • Breakpoints appear randomly

  • increasing stacks i will try!

maozhiqiang avatar Sep 22 '20 10:09 maozhiqiang

hi @kan-bayashi ! increasing stacks not resolved this problem! I suspect that the deconvolution kernel is responsible for this problem!

maozhiqiang avatar Sep 28 '20 00:09 maozhiqiang

Thank you for sharing your experiments. In #216, @LLianJJun suggested the better config. It is worthwhile to try.

kan-bayashi avatar Sep 28 '20 00:09 kan-bayashi

@kan-bayashi Thanks!I will try this!

maozhiqiang avatar Sep 28 '20 00:09 maozhiqiang

I also find this issue in PWG using sing data, anybody soled this problem?

zpcoftts avatar Oct 14 '20 06:10 zpcoftts

@zpcoftts! I try to change the size of convolution kernel, deepen the number of stack layers, modify the discriminant function, remove the weight normalize, etc., which can not solve this problem

maozhiqiang avatar Oct 14 '20 07:10 maozhiqiang

@maozhiqiang Have you tried increasing "batch_max_steps"?

LLianJJun avatar Nov 09 '20 05:11 LLianJJun

@LLianJJun ! Not yet, Does it affect the sound quality? My config is as follows sample_rate=16000, batch_max_steps=8000

maozhiqiang avatar Nov 09 '20 06:11 maozhiqiang

@maozhiqiang I'm not sure. however, I have a breakpoint in the continuous section of the voiced sound component. so, the cause of the problem is suspected to be the receptive field or speech segment size. I will share the results after the experiment. bbb111

LLianJJun avatar Nov 09 '20 06:11 LLianJJun

@LLianJJun thanks!

I changed the receptive field by changing stacks=5, But the problem remains

maozhiqiang avatar Nov 09 '20 06:11 maozhiqiang

I also meet this issue, but it does not appear in pretrain model audios, only appears after dis net is introduced

OnceJune avatar Mar 05 '21 06:03 OnceJune

@maozhiqiang I'm not sure. however, I have a breakpoint in the continuous section of the voiced sound component. so, the cause of the problem is suspected to be the receptive field or speech segment size. I will share the results after the experiment. bbb111

hi @LLianJJun. Have you solved this problem?

Alexey322 avatar Jun 07 '21 12:06 Alexey322

@maozhiqiang @LLianJJun @OnceJune @kan-bayashi @Alexey322 Hi all, have you solved this probelm well? This phenomenon also appears in my data set,I have tried the following methods, but none of them could solve this problem well.

  1. increase the frame_length and frame_shift setting for multi-resoultion stft loss
  2. employ big generators and big discriminators
  3. finetuned vocoder using force-align mel from Taco2 model Any suggestions for me? Many thanks.

GuangChen2016 avatar Mar 22 '22 02:03 GuangChen2016

@GuangChen2016 Hi, I'm now using hifigan with 200w+ steps' training, then finetune with gta, which has no breakpoint inside phoneme.

OnceJune avatar Mar 22 '22 06:03 OnceJune

@OnceJune Thanks for your reply. Yeah, hifigan is much better and almost no breakpoint inside phoneme. However, it's much slower than melgan. What's your configs like for hifigan? Such as upsample_scales for genrator and hopsize. And which training script did you use? Did you use the config_v1.json and training scripts here or modify anythings? Thanks again.

GuangChen2016 avatar Mar 22 '22 06:03 GuangChen2016

@GuangChen2016 hifigan v1 has good audio quality, and it is large and slow. I used v2, with hop size 256, and the infer speed is good to me. You can also make hifigan multiband.

OnceJune avatar Mar 22 '22 06:03 OnceJune

@OnceJune Yeah, hifigan v1 has good audio quality and no no breakpoint, but when I moved to hifigan v2, the breakpoint appears sometimes. Which repo do you use to train your hifigan v2 models? This repo or the official one? By the way, did you modify or add additional loss to improve the results for hifigan v2?

GuangChen2016 avatar Mar 22 '22 07:03 GuangChen2016

@GuangChen2016 The official one, I didn't modify any layers or add any loss.

OnceJune avatar Mar 22 '22 07:03 OnceJune

@OnceJune Thanks you very much, I also used the official one. Could you send me some samples of hifigan v2?

GuangChen2016 avatar Mar 22 '22 07:03 GuangChen2016

@GuangChen2016 Sorry, I'm using a commercial dataset. How many steps did you train with hifigan? hifigan might need 150w+ steps to get a stable quality.

OnceJune avatar Mar 22 '22 07:03 OnceJune

@OnceJune Many thanks. I have trained the model for 200W steps but I haven't finetuned with gta for hifigan v2 now. Or could you describle the quality compared with LPCNet and Melgan-stft? also the robustness.

GuangChen2016 avatar Mar 22 '22 07:03 GuangChen2016