ParallelWaveGAN Breakpoint problem

hi @kan-bayashi , When using this project to train multi-band melgan vocoder, there will always be breakpoints in the generate audio! Trying to modify the kernal size of the first layer or the kernel size of the upsample laryers both This phenomenon cannot be eliminated. When I listen to the audio produced by the released model, there is a similar problem, For details, please refer to the attachment identification Screenshot from 2020-09-15 11-53-52 LJ050-0033_gen.zip Is there any way to eliminate it, thank you！

Sep 15 '20 05:09 maozhiqiang

Sorry for the late reply. Unfortunately, I have no clear idea to solve this problem. I wrote some comments:

Recently I fixed PQMF problem. It may be affected to the quality.
How about the increasing stacks to expand receptive field?
Is the breakpoint happened at the same position? (e.g, time or specific phoneme or etc...)

Sep 22 '20 09:09 kan-bayashi

Thank you for your

new PQMF was not resolved this problem!,
Breakpoints appear randomly
increasing stacks i will try!

Sep 22 '20 10:09 maozhiqiang

hi @kan-bayashi ! increasing stacks not resolved this problem! I suspect that the deconvolution kernel is responsible for this problem！

Sep 28 '20 00:09 maozhiqiang

Thank you for sharing your experiments. In #216, @LLianJJun suggested the better config. It is worthwhile to try.

Sep 28 '20 00:09 kan-bayashi

@kan-bayashi Thanks！I will try this!

Sep 28 '20 00:09 maozhiqiang

I also find this issue in PWG using sing data, anybody soled this problem?

Oct 14 '20 06:10 zpcoftts

@zpcoftts! I try to change the size of convolution kernel, deepen the number of stack layers, modify the discriminant function, remove the weight normalize, etc., which can not solve this problem

Oct 14 '20 07:10 maozhiqiang

@maozhiqiang Have you tried increasing "batch_max_steps"?

Nov 09 '20 05:11 LLianJJun

@LLianJJun ! Not yet, Does it affect the sound quality? My config is as follows sample_rate=16000, batch_max_steps=8000

Nov 09 '20 06:11 maozhiqiang

@maozhiqiang I'm not sure. however, I have a breakpoint in the continuous section of the voiced sound component. so, the cause of the problem is suspected to be the receptive field or speech segment size. I will share the results after the experiment. bbb111

Nov 09 '20 06:11 LLianJJun

@LLianJJun thanks!

I changed the receptive field by changing stacks=5， But the problem remains

Nov 09 '20 06:11 maozhiqiang

I also meet this issue, but it does not appear in pretrain model audios, only appears after dis net is introduced

Mar 05 '21 06:03 OnceJune

@maozhiqiang I'm not sure. however, I have a breakpoint in the continuous section of the voiced sound component. so, the cause of the problem is suspected to be the receptive field or speech segment size. I will share the results after the experiment.

hi @LLianJJun. Have you solved this problem?

Jun 07 '21 12:06 Alexey322

@maozhiqiang @LLianJJun @OnceJune @kan-bayashi @Alexey322 Hi all, have you solved this probelm well? This phenomenon also appears in my data set，I have tried the following methods, but none of them could solve this problem well.

increase the frame_length and frame_shift setting for multi-resoultion stft loss
employ big generators and big discriminators
finetuned vocoder using force-align mel from Taco2 model Any suggestions for me? Many thanks.

Mar 22 '22 02:03 GuangChen2016

@GuangChen2016 Hi, I'm now using hifigan with 200w+ steps' training, then finetune with gta, which has no breakpoint inside phoneme.

Mar 22 '22 06:03 OnceJune

@OnceJune Thanks for your reply. Yeah, hifigan is much better and almost no breakpoint inside phoneme. However, it's much slower than melgan. What's your configs like for hifigan? Such as upsample_scales for genrator and hopsize. And which training script did you use? Did you use the config_v1.json and training scripts here or modify anythings? Thanks again.

Mar 22 '22 06:03 GuangChen2016

@GuangChen2016 hifigan v1 has good audio quality, and it is large and slow. I used v2, with hop size 256, and the infer speed is good to me. You can also make hifigan multiband.

Mar 22 '22 06:03 OnceJune

@OnceJune Yeah, hifigan v1 has good audio quality and no no breakpoint, but when I moved to hifigan v2, the breakpoint appears sometimes. Which repo do you use to train your hifigan v2 models? This repo or the official one？ By the way, did you modify or add additional loss to improve the results for hifigan v2?

Mar 22 '22 07:03 GuangChen2016

@GuangChen2016 The official one, I didn't modify any layers or add any loss.

Mar 22 '22 07:03 OnceJune

@OnceJune Thanks you very much, I also used the official one. Could you send me some samples of hifigan v2?

Mar 22 '22 07:03 GuangChen2016

@GuangChen2016 Sorry, I'm using a commercial dataset. How many steps did you train with hifigan? hifigan might need 150w+ steps to get a stable quality.

Mar 22 '22 07:03 OnceJune

@OnceJune Many thanks. I have trained the model for 200W steps but I haven't finetuned with gta for hifigan v2 now. Or could you describle the quality compared with LPCNet and Melgan-stft? also the robustness.

Mar 22 '22 07:03 GuangChen2016

ParallelWaveGAN ParallelWaveGAN copied to clipboard

Breakpoint problem

ParallelWaveGAN
ParallelWaveGAN copied to clipboard