deepvoice3_pytorch icon indicating copy to clipboard operation
deepvoice3_pytorch copied to clipboard

the result obtained by eval_model or synthesis is much worse than which is obtained by train process

Open Eleanor456 opened this issue 5 years ago • 8 comments

when I generated the audio by the checkpoint with 32000 steps, the output was pure noise. And the alignment pictures are always empty as following. How can I get the result close normal sound which obtained during training.

step000034000_text1_multispeaker10_alignment

Eleanor456 avatar May 30 '20 18:05 Eleanor456

What datasets and presets are you using?

marianbasti avatar Jun 01 '20 07:06 marianbasti

您正在使用哪些数据集和预设?

Chinese datasets with 61 speakers, and the preset I have modified according to the deepvoice3_vctk.json

Eleanor456 avatar Jun 01 '20 07:06 Eleanor456

What frontend selected? I'm trying to train on spanish speakers and the results are a litte gibberish, but not noise.

marianbasti avatar Jun 01 '20 07:06 marianbasti

What frontend selected? I'm trying to train on spanish speakers and the results are a litte gibberish, but not noise.

I convert the transcript to pinyin form, so I selected the en frontend. I think the bad result may be the epochs is not enough.

Eleanor456 avatar Jun 01 '20 07:06 Eleanor456

Shouldn't be so noisy. This is what i get with 40000 steps on 13 speaker dataset. step000040000_text3_multispeaker10_alignment

es frontend, so no phonetics dictionary

marianbasti avatar Jun 01 '20 08:06 marianbasti

Shouldn't be so noisy. This is what i get with 40000 steps on 13 speaker dataset. step000040000_text3_multispeaker10_alignment

es frontend, so no phonetics dictionary

This is the result after training for 61000 steps with batch size of 64. image

It is slightly better than before, so I plan to continue training and observe the result.

Eleanor456 avatar Jun 01 '20 08:06 Eleanor456

Please let me know how well it goes with that batch size

marianbasti avatar Jun 01 '20 10:06 marianbasti

The same problem. I am using the MAGICDATA dataset, 1016 speakers, training at 1500,000~2000,000 steps got good result in trainging process. but the inference with these two model got bad speech. @Eleanor456 Is your model good right now?

JohnHerry avatar Apr 13 '21 08:04 JohnHerry