vq-vae icon indicating copy to clipboard operation
vq-vae copied to clipboard

About the main training details.

Open mazzzystar opened this issue 7 years ago • 0 comments

Thanks for your implementation ! As my understanding on VQ-VAE is not so clear, I discovered some implementation in your code. Can these process can be mainly divided into: * Encode your quantized_raw_audio with several 1dConv, got the encoder output z_e_x. * Pass your z_e_x through VQ and got the z_q_x. * Encode your quantized_raw_audio with only 1 1dConv again, and got the quant_raw_out. Then concat the z_q_x with speaker_embedding, finally add the quant_raw_out with the combination of z_q_x + speaker_embedding together , put them into WaveNet decoder to reconstruct back .

I'm curious about the embedding process of:

conv = conv1d(inputs, filters=H, size=size, rate=rate, bn=True, padding="causal", scope="conv") # (B, T, H)
# conditions
speaker_emb = tf.tile(tf.expand_dims(speaker_emb, 1), [1, t, 1]) # (B, t, L)
cond = conv1d(inputs=tf.concat((speaker_emb, z_q), -1), filters=H, padding="causal", bn=True) # (B, t, H)

# Merge
cond = tf.expand_dims(cond, -2) # (B, t, 1, H)
# print(cond, conv)
conv = tf.reshape(conv, (B, t, -1, H)) # (B, t, ?, H)

conv += cond # (B, t, ?, H)
conv = tf.reshape(conv, (B, T, H))  # (B, T, H)

Why can you make sure your embedding of conv(B, T, H) can be reshaped into (B, t, -1, H) ?

mazzzystar avatar Oct 07 '18 10:10 mazzzystar