style-token_tacotron2 icon indicating copy to clipboard operation
style-token_tacotron2 copied to clipboard

the trained model generates different wavs with the same text and reference audio

Open MorganCZY opened this issue 4 years ago • 15 comments

When doing tests, I found each time I ran the synthesize.py(with the same text and reference audio), I got different results(namely different syntheized wavs). After looking up the code, I didn't find there are random operations when synthesizing. Could you give me some explanations?

MorganCZY avatar Sep 17 '20 09:09 MorganCZY

Please specify reference audio's path in the 'tacotron_style_reference_audio' of hparams.py, then synthesizing. Feel free to raise more questions.

cnlinxi avatar Sep 17 '20 10:09 cnlinxi

Yes,I have specified the reference audio path in hparams.py

在 2020年9月17日,18:59,cnlinxi [email protected] 写道:

Please specify reference audio's path in the 'tacotron_style_reference_audio' of hparams.py, then synthesizing. Feel free to raise more questions.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

MorganCZY avatar Sep 17 '20 11:09 MorganCZY

In hparams.py:

tacotron_style_alignment=None,

you can manually specify style token alignment weights instead of getting them from reference audio.

Do you mean this?

cnlinxi avatar Sep 17 '20 11:09 cnlinxi

image Here are my hparams settings. I specify a reference audio path, which will be sent to gst module(namely the reference encoder). For a trained model, the weights of encoder, decoder, attention and gst are all fixed. So, basically I can't understand why I will get different wavs with the same text and the same reference audio as input, considering that there seems no random operations in the code.

MorganCZY avatar Sep 17 '20 12:09 MorganCZY

@MorganCZY In the original Tacotron-2, dropout was turned on during inference, and so is this one. So, every time you generate wav, the audio will be different.

cnlinxi avatar Sep 17 '20 12:09 cnlinxi

我也想问下这个问题, 那我如果想针对同一个tacotron_style_reference_audio 使得每次出来的音频都是相同的,要如何操作呢

CathyW77 avatar Sep 17 '20 12:09 CathyW77

@CathyW77 在生成时,关闭prenet中的dropout应该就可以了。在tacotron/models/modules.py中Prenet类中,有:

x = tf.layers.dropout(dense, rate=self.drop_rate, training=True, name='dropout_{}'.format(i + 1) + self.scope)

tf.layers.dropout()中的参数training在生成时,置为False。

cnlinxi avatar Sep 17 '20 13:09 cnlinxi

It's indeed the only random opration during synthesizing process after searching the whole repo. But both of setting "training=False" or "training=self.is_training" in prenet can not then generate correct wavs.

MorganCZY avatar Sep 18 '20 03:09 MorganCZY

@MorganCZY What does correct wav mean? Can't generate audio?

cnlinxi avatar Sep 18 '20 03:09 cnlinxi

samples.zip true.wav--->"training=True"; self_is_training.wav--->"training=self.is_training"; false.wav--->"training=False"

MorganCZY avatar Sep 18 '20 04:09 MorganCZY

@MorganCZY This completely failed. Can you show the sample of your training corpus and the alignment during training?

cnlinxi avatar Sep 18 '20 04:09 cnlinxi

I trained this model with thch30s. alignment.zip Here are the latest three alignment graphs, corresponding to 6w, 6.5w, 7w steps.

MorganCZY avatar Sep 18 '20 06:09 MorganCZY

@cnlinxi 我设置为false之后就出来的音频都是有问题了,发不出任何一个字,改为true又能正常生成

CathyW77 avatar Sep 18 '20 06:09 CathyW77

@CathyW77 欸,为啥,这个好奇怪。不过我确实没有尝试过关闭这个dropout。

cnlinxi avatar Sep 23 '20 04:09 cnlinxi

I trained this model with thch30s. alignment.zip Here are the latest three alignment graphs, corresponding to 6w, 6.5w, 7w steps.

@MorganCZY

This is a bit strange. I'm sorry that I do not know what happened. The alignment is good, you can check your synthesis.

cnlinxi avatar Sep 23 '20 05:09 cnlinxi