autovc Differences in Architecture Between Code and Paper

Differences in Architecture Between Code and Paper

Open taubaaron opened this issue 3 years ago • 1 comments

Hey, firstly - thank you very much for sharing your work, it really is interesting.

I have a few issues regarding the implication of the paper: "AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss":

In section 3.1 "Problem Formulation", it is explained (and showed in figure 1) that the output from the speaker encoder (input was target speaker utterance) is fed directly into the decoder (after the bottleneck). In the code implementation on the other hand, it seems that the output from the speaker encoder is actually concatenated with the Mel_spectrogram and fed into the content encoder and not later after the bottleneck.
Again, in figure 1, it is shown that during train stage the "style" is used from the same speaker but in another file/section for comparison. Is that implemented in the code too? it didn't seem like it but I might be missing something.
In Table 1 (page 8), you present results for classification testing, for the output of the content encoder. Is there a way I can try to regenerate the same results? (can you share this part of the code too?)

Thanks very much, Aaron

Oct 21 '21 06:10 taubaaron

The speaker emb is also concatenated with the encoder output before feeding into the decoder.
yes, the speaker emb is extracted from the same speaker but most likely different utterances.
just train a classifier on the encoder output

Oct 21 '21 14:10 auspicious3000