gst-tacotron
gst-tacotron copied to clipboard
Train as a Tacotron1 script problem
Thanks for your great work, but I found that if I set the hyperparameter use_gst=False
and run, it seemed different from my understanding of Tacotron1. The tacotron.py code is part of here.
if reference_mel is not None:
# Reference encoder
refnet_outputs = reference_encoder(
reference_mel,
filters=hp.reference_filters,
kernel_size=(3,3),
strides=(2,2),
encoder_cell=GRUCell(hp.reference_depth),
is_training=is_training) # [N, 128]
self.refnet_outputs = refnet_outputs
if hp.use_gst:
# Style attention
style_attention = MultiheadAttention(
tf.expand_dims(refnet_outputs, axis=1), # [N, 1, 128]
tf.tanh(tf.tile(tf.expand_dims(gst_tokens, axis=0), [batch_size,1,1])), # [N, hp.num_gst, 256/hp.num_heads]
num_heads=hp.num_heads,
num_units=hp.style_att_dim,
attention_type=hp.style_att_type)
style_embeddings = style_attention.multi_head_attention() # [N, 1, 256]
else:
style_embeddings = tf.expand_dims(refnet_outputs, axis=1) # [N, 1, 128]
else:
print("Use random weight for GST.")
random_weights = tf.random_uniform([hp.num_heads, hp.num_gst], maxval=1.0, dtype=tf.float32)
random_weights = tf.nn.softmax(random_weights, name="random_weights")
style_embeddings = tf.matmul(random_weights, tf.nn.tanh(gst_tokens))
style_embeddings = tf.reshape(style_embeddings, [1, 1] + [hp.num_heads * gst_tokens.get_shape().as_list()[1]])
Original Tacotron1 code shoudn't train with the reference encoder part right?
However, your code pass the non-gst mode data into a reference_encoder
, which sounds strange ?
Maybe we can exchange the two IF
condition codes to make it correct.
if hp.use_gst:
***
if reference_mel is not None:
***
THANKS
@dazenhom Hi, thanks for your notes. In this repo, using use_gst=False
doesn't mean the tacotron1 model. Google also has another paper, which uses reference encoder to do style and multi-speaker synthesis. You can found it at https://arxiv.org/abs/1803.09047.
@syang1993 Thanks for your reply, I took a mistake with Tacotron1 from your work. I shall find another version of Tacotron1 to run my test. Thanks anyway.
I have try “use_gst=False”, but it seems to be the same as tacotron1? Although the refnet_outputs will change, but the generated audio will hardly change with different reference audio.
@hyzhan In my experienc, maybe it's because of your data. If you use some expressive speakers as your trainning data and do the inference, the speech can be different(changed with the reference audio) . Otherwise, it can remain little different as you mentioned.