gst-tacotron Train as a Tacotron1 script problem

Thanks for your great work, but I found that if I set the hyperparameter use_gst=False and run, it seemed different from my understanding of Tacotron1. The tacotron.py code is part of here.

      if reference_mel is not None:
        # Reference encoder
        refnet_outputs = reference_encoder(
          reference_mel, 
          filters=hp.reference_filters, 
          kernel_size=(3,3),
          strides=(2,2),
          encoder_cell=GRUCell(hp.reference_depth),
          is_training=is_training)                                                 # [N, 128]
        self.refnet_outputs = refnet_outputs                                       

        if hp.use_gst:
          # Style attention
          style_attention = MultiheadAttention(
            tf.expand_dims(refnet_outputs, axis=1),                                   # [N, 1, 128]
            tf.tanh(tf.tile(tf.expand_dims(gst_tokens, axis=0), [batch_size,1,1])),            # [N, hp.num_gst, 256/hp.num_heads]   
            num_heads=hp.num_heads,
            num_units=hp.style_att_dim,
            attention_type=hp.style_att_type)

          style_embeddings = style_attention.multi_head_attention()                   # [N, 1, 256]
        else:
          style_embeddings = tf.expand_dims(refnet_outputs, axis=1)                   # [N, 1, 128]
      else:
        print("Use random weight for GST.")
        random_weights = tf.random_uniform([hp.num_heads, hp.num_gst], maxval=1.0, dtype=tf.float32)
        random_weights = tf.nn.softmax(random_weights, name="random_weights")
        style_embeddings = tf.matmul(random_weights, tf.nn.tanh(gst_tokens))
        style_embeddings = tf.reshape(style_embeddings, [1, 1] + [hp.num_heads * gst_tokens.get_shape().as_list()[1]])

Original Tacotron1 code shoudn't train with the reference encoder part right? However, your code pass the non-gst mode data into a reference_encoder, which sounds strange ? Maybe we can exchange the two IF condition codes to make it correct.

if hp.use_gst: 
***
if reference_mel is not None:  
***

THANKS

Jul 25 '18 08:07 dazenhom

@dazenhom Hi, thanks for your notes. In this repo, using use_gst=False doesn't mean the tacotron1 model. Google also has another paper, which uses reference encoder to do style and multi-speaker synthesis. You can found it at https://arxiv.org/abs/1803.09047.

Jul 25 '18 08:07 syang1993

@syang1993 Thanks for your reply, I took a mistake with Tacotron1 from your work. I shall find another version of Tacotron1 to run my test. Thanks anyway.

Jul 25 '18 08:07 dazenhom

I have try “use_gst=False”， but it seems to be the same as tacotron1？ Although the refnet_outputs will change, but the generated audio will hardly change with different reference audio.

Oct 16 '18 04:10 hyzhan

@hyzhan In my experienc, maybe it's because of your data. If you use some expressive speakers as your trainning data and do the inference, the speech can be different(changed with the reference audio) . Otherwise, it can remain little different as you mentioned.

Oct 18 '18 11:10 dazenhom

gst-tacotron gst-tacotron copied to clipboard

Train as a Tacotron1 script problem

gst-tacotron
gst-tacotron copied to clipboard