vq-vae
vq-vae copied to clipboard
About the main training details.
Thanks for your implementation !
As my understanding on VQ-VAE is not so clear, I discovered some implementation in your code. Can these process can be mainly divided into:
* Encode your quantized_raw_audio with several 1dConv, got the encoder output z_e_x.
* Pass your z_e_x through VQ and got the z_q_x.
* Encode your quantized_raw_audio with only 1 1dConv again, and got the quant_raw_out. Then concat the z_q_x with speaker_embedding, finally add the quant_raw_out with the combination of z_q_x + speaker_embedding together , put them into WaveNet decoder to reconstruct back .
I'm curious about the embedding process of:
conv = conv1d(inputs, filters=H, size=size, rate=rate, bn=True, padding="causal", scope="conv") # (B, T, H)
# conditions
speaker_emb = tf.tile(tf.expand_dims(speaker_emb, 1), [1, t, 1]) # (B, t, L)
cond = conv1d(inputs=tf.concat((speaker_emb, z_q), -1), filters=H, padding="causal", bn=True) # (B, t, H)
# Merge
cond = tf.expand_dims(cond, -2) # (B, t, 1, H)
# print(cond, conv)
conv = tf.reshape(conv, (B, t, -1, H)) # (B, t, ?, H)
conv += cond # (B, t, ?, H)
conv = tf.reshape(conv, (B, T, H)) # (B, T, H)
Why can you make sure your embedding of conv(B, T, H) can be reshaped into (B, t, -1, H) ?