deepvoice3_pytorch the result obtained by eval_model or synthesis is much worse than which is obtained by train process

the result obtained by eval_model or synthesis is much worse than which is obtained by train process

Open Eleanor456 opened this issue 5 years ago • 8 comments

when I generated the audio by the checkpoint with 32000 steps, the output was pure noise. And the alignment pictures are always empty as following. How can I get the result close normal sound which obtained during training.

step000034000_text1_multispeaker10_alignment

May 30 '20 18:05 Eleanor456

What datasets and presets are you using?

Jun 01 '20 07:06 marianbasti

您正在使用哪些数据集和预设？

Chinese datasets with 61 speakers, and the preset I have modified according to the deepvoice3_vctk.json

Jun 01 '20 07:06 Eleanor456

What frontend selected? I'm trying to train on spanish speakers and the results are a litte gibberish, but not noise.

Jun 01 '20 07:06 marianbasti

What frontend selected? I'm trying to train on spanish speakers and the results are a litte gibberish, but not noise.

I convert the transcript to pinyin form, so I selected the en frontend. I think the bad result may be the epochs is not enough.

Jun 01 '20 07:06 Eleanor456

Shouldn't be so noisy. This is what i get with 40000 steps on 13 speaker dataset. step000040000_text3_multispeaker10_alignment

es frontend, so no phonetics dictionary

Jun 01 '20 08:06 marianbasti

Shouldn't be so noisy. This is what i get with 40000 steps on 13 speaker dataset.

es frontend, so no phonetics dictionary

This is the result after training for 61000 steps with batch size of 64.

It is slightly better than before, so I plan to continue training and observe the result.

Jun 01 '20 08:06 Eleanor456

Please let me know how well it goes with that batch size

Jun 01 '20 10:06 marianbasti

The same problem. I am using the MAGICDATA dataset, 1016 speakers, training at 1500,000~2000,000 steps got good result in trainging process. but the inference with these two model got bad speech. @Eleanor456 Is your model good right now?

Apr 13 '21 08:04 JohnHerry

deepvoice3_pytorch deepvoice3_pytorch copied to clipboard

the result obtained by eval_model or synthesis is much worse than which is obtained by train process

deepvoice3_pytorch
deepvoice3_pytorch copied to clipboard