autovc Why need original speaker embeddings concatenated with original speaker spectrogram?

Theoretically, the original speaker embedding information has already been contained in the spectrogram. The network will automatically squeeze the original speaker embedding information out after convergence. why the original speaker embedding is still needed?

Oct 07 '19 04:10 nkcdy

So that the encoder dose not need to learn that information from the spectrogram.

Oct 07 '19 09:10 auspicious3000

@auspicious3000 I don't quite understand what you mean. The "encoder" you mentioned is the content encoder or the speaker encoder? can you please explain it in more detail? I can't find the answer from the paper. Actually, I'm wondering what will happen if original speaker embedding is discarded from the content encoder...

Oct 07 '19 11:10 nkcdy

content encoder Without the speaker emb, it is harder for the encoder to learn that information from the spectrogram. Since you already have that info, just give it to the encoder so that the encoder dose not need to learn that information from the spectrogram.

Oct 07 '19 14:10 auspicious3000

@auspicious3000 its still not intuitive for me to understand what you said. Maybe I should read some related papers behind this theory. Do you have any recommended papers?

Oct 08 '19 09:10 nkcdy

There must be papers describing this technique, but I don't know any of them. It should be very simple and intuitive. For example, you need to solve A and B, but if I give you the answer of B, you only need to solve A. That's it.

Oct 09 '19 16:10 auspicious3000

@auspicious3000 yes, from the whole network view, you are right. Suppose B is the emb_trg and A is original content, it is intuitive. My question is the why emb_org is still needed to cascade with the original spectrogram when training because emb_org and emb_trg are identical when traning. My thought is the content encoder can still learn to generate only the content information even if there is no emb_org cascaded with original spectrogram.

Oct 10 '19 00:10 nkcdy

My explanation is for the encoder and I perfectly understood your question. Without feeding emb_org the encoder can learn to disentangle the content and identity, but it will be easier if the identity is already given.

Oct 10 '19 05:10 auspicious3000

@auspicious3000 here is where i got confused. The content encoder is just an encoder or just an network. An network can generate any possible output unless it get some guide. In my though, the "guide" should come from the BP algorithm not from the input. Now i'm starting to know why the deep learning is sometimes called "alchemy" in China in some case. The emb_org in this case is very similar with a material called "YaoYin" in Ancient China alchemy (Don't take it seriously, just a joke). Maybe my issue is the way of thinking. I have used to think everything in a linear way while the deep network is not a linear system.

Oct 10 '19 08:10 nkcdy

@auspicious3000 here is where i got confused. The content encoder is just an encoder or just an network. An network can generate any possible output unless it get some guide. In my though, the "guide" should come from the BP algorithm not from the input. Now i'm starting to know why the deep learning is sometimes called "alchemy" in China in some case. The emb_org in this case is very similar with a material called "YaoYin" in Ancient China alchemy (Don't take it seriously, just a joke). Maybe my issue is the way of thinking. I have used to think everything in a linear way while the deep network is not a linear system.

thank you and the author. I think this idea which concatenate speaker embedding to mels is an outstanding innovation.

when data is not so much, or we do not train it convergence, this idea is useful
although speaker embedding can be the result of mels among NN encoder, but this idea is just like "struct output NN" or like guided attention like tacotron

Both you two are right~ And this is somehow a really good idea

Dec 23 '20 09:12 ruclion

@auspicious3000 here is where i got confused. The content encoder is just an encoder or just an network. An network can generate any possible output unless it get some guide. In my though, the "guide" should come from the BP algorithm not from the input. Now i'm starting to know why the deep learning is sometimes called "alchemy" in China in some case. The emb_org in this case is very similar with a material called "YaoYin" in Ancient China alchemy (Don't take it seriously, just a joke). Maybe my issue is the way of thinking. I have used to think everything in a linear way while the deep network is not a linear system.

thank you and the author. I think this idea which concatenate speaker embedding to mels is an outstanding innovation.

when data is not so much, or we do not train it convergence, this idea is useful

although speaker embedding can be the result of mels among NN encoder, but this idea is just like "struct output NN" or like guided attention like tacotron

Both you two are right~ And this is somehow a really good idea

I suddenly think out an example, multi-speaker ASR:

If we just use mels or MFCC as input to the NN, it's ok when dataset is many people
If we use mels or MFCC as input, different people train different ASR models, it's ok
But the most way is, use mels and speaker id as input together, which means we want to get P(mel|speaker id), it will be more concentrated as well as shared weights among different people~

Dec 23 '20 09:12 ruclion

autovc autovc copied to clipboard

Why need original speaker embeddings concatenated with original speaker spectrogram?

autovc
autovc copied to clipboard