vae_tacotron
vae_tacotron copied to clipboard
Need some clarification regarding Reference encoder architecture
@yanggeng1995 On paper author mentioned on ReferenceEncoder
(section 2.2 ) the output of the GRU layers passed through two separate Fully connected layers, but in this implementation, last GRU state passed to two separate FC layer
def ReferenceEncoder(inputs, input_lengths, filters, kernel_size, strides, is_training, scope='reference_encoder'):
with tf.variable_scope(scope):
reference_output = tf.expand_dims(inputs, axis=-1)
for i, channel in enumerate(filters):
reference_output = conv2d(reference_output, channel, kernel_size,
strides, tf.nn.relu, is_training, 'conv2d_{}'.format(i))
shape = shape_list(reference_output)
reference_output = tf.reshape(reference_output, shape[:-2] + [shape[2] * shape[3]])
#GRU
encoder_outputs, encoder_state = tf.nn.dynamic_rnn(
cell=GRUCell(128),
inputs=reference_output,
sequence_length=input_lengths,
dtype=tf.float32
)
return encoder_state
As you see encoder_state
return instead of encoder_output
.
On the other author mentioned in the same section that they used same reference_encoder
as used in GST tacotron , and I go through the best gst-tacotron implementation on github i.e. https://github.com/syang1993/gst-tacotron
here also reference encoder
returned encoder_output
and it working fine.
def reference_encoder(inputs, filters, kernel_size, strides, encoder_cell, is_training, scope='ref_encoder'):
with tf.variable_scope(scope):
ref_outputs = tf.expand_dims(inputs,axis=-1)
# CNN stack
for i, channel in enumerate(filters):
ref_outputs = conv2d(ref_outputs, channel, kernel_size, strides, tf.nn.relu, is_training, 'conv2d_%d' % i)
shapes = shape_list(ref_outputs)
ref_outputs = tf.reshape(
ref_outputs,
shapes[:-2] + [shapes[2] * shapes[3]])
# RNN
encoder_outputs, encoder_state = tf.nn.dynamic_rnn(
encoder_cell,
ref_outputs,
dtype=tf.float32)
reference_state = tf.layers.dense(encoder_outputs[:,-1,:], 128, activation=tf.nn.tanh) # [N, 128]
return reference_state
Interesting thing is that on GST tacotron paper author mentioned to use last GRU state as the reference embedding.
Please take note and clarify weather to take encoder_output
or encoder_state
as the output of reference_encoder
.
Thanks
Sorry, I am on the Spring Festival holiday, and it is a little late to see the news, I use encoder_state as the output of reference_encoder, but the paper did not specify whether to use state or output, this need to be verified by experiments, and "On paper author mentioned on ReferenceEncoder
(section 2.2 ) the output of the GRU layers passed through two separate Fully connected layers," you can find implementation on here:
https://github.com/yanggeng1995/vae_tacotron/blob/b0288f1caa776a98195dd94d1e8ea7ca6ec05f57/models/modules.py#L5-L20