style-token_tacotron2 the trained model generates different wavs with the same text and reference audio

When doing tests, I found each time I ran the synthesize.py(with the same text and reference audio), I got different results(namely different syntheized wavs). After looking up the code, I didn't find there are random operations when synthesizing. Could you give me some explanations?

Sep 17 '20 09:09 MorganCZY

Please specify reference audio's path in the 'tacotron_style_reference_audio' of hparams.py, then synthesizing. Feel free to raise more questions.

Sep 17 '20 10:09 cnlinxi

Yes，I have specified the reference audio path in hparams.py

在 2020年9月17日，18:59，cnlinxi [email protected] 写道：

Please specify reference audio's path in the 'tacotron_style_reference_audio' of hparams.py, then synthesizing. Feel free to raise more questions.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Sep 17 '20 11:09 MorganCZY

In hparams.py:

tacotron_style_alignment=None,

you can manually specify style token alignment weights instead of getting them from reference audio.

Do you mean this?

Sep 17 '20 11:09 cnlinxi

Here are my hparams settings. I specify a reference audio path, which will be sent to gst module(namely the reference encoder). For a trained model, the weights of encoder, decoder, attention and gst are all fixed. So, basically I can't understand why I will get different wavs with the same text and the same reference audio as input, considering that there seems no random operations in the code.

Sep 17 '20 12:09 MorganCZY

@MorganCZY In the original Tacotron-2, dropout was turned on during inference, and so is this one. So, every time you generate wav, the audio will be different.

Sep 17 '20 12:09 cnlinxi

我也想问下这个问题，那我如果想针对同一个tacotron_style_reference_audio 使得每次出来的音频都是相同的，要如何操作呢

Sep 17 '20 12:09 CathyW77

@CathyW77 在生成时，关闭prenet中的dropout应该就可以了。在tacotron/models/modules.py中Prenet类中，有：

x = tf.layers.dropout(dense, rate=self.drop_rate, training=True, name='dropout_{}'.format(i + 1) + self.scope)

对tf.layers.dropout()中的参数training在生成时，置为False。

Sep 17 '20 13:09 cnlinxi

It's indeed the only random opration during synthesizing process after searching the whole repo. But both of setting "training=False" or "training=self.is_training" in prenet can not then generate correct wavs.

Sep 18 '20 03:09 MorganCZY

@MorganCZY What does correct wav mean? Can't generate audio?

Sep 18 '20 03:09 cnlinxi

samples.zip true.wav--->"training=True"; self_is_training.wav--->"training=self.is_training"; false.wav--->"training=False"

Sep 18 '20 04:09 MorganCZY

@MorganCZY This completely failed. Can you show the sample of your training corpus and the alignment during training?

Sep 18 '20 04:09 cnlinxi

I trained this model with thch30s. alignment.zip Here are the latest three alignment graphs, corresponding to 6w, 6.5w, 7w steps.

Sep 18 '20 06:09 MorganCZY

@cnlinxi 我设置为false之后就出来的音频都是有问题了，发不出任何一个字，改为true又能正常生成

Sep 18 '20 06:09 CathyW77

@CathyW77 欸，为啥，这个好奇怪。不过我确实没有尝试过关闭这个dropout。

Sep 23 '20 04:09 cnlinxi

I trained this model with thch30s. alignment.zip Here are the latest three alignment graphs, corresponding to 6w, 6.5w, 7w steps.

@MorganCZY

This is a bit strange. I'm sorry that I do not know what happened. The alignment is good, you can check your synthesis.

Sep 23 '20 05:09 cnlinxi

style-token_tacotron2 style-token_tacotron2 copied to clipboard

the trained model generates different wavs with the same text and reference audio

style-token_tacotron2
style-token_tacotron2 copied to clipboard