autovc
autovc copied to clipboard
Why need original speaker embeddings concatenated with original speaker spectrogram?
Theoretically, the original speaker embedding information has already been contained in the spectrogram. The network will automatically squeeze the original speaker embedding information out after convergence. why the original speaker embedding is still needed?
So that the encoder dose not need to learn that information from the spectrogram.
@auspicious3000 I don't quite understand what you mean. The "encoder" you mentioned is the content encoder or the speaker encoder? can you please explain it in more detail? I can't find the answer from the paper. Actually, I'm wondering what will happen if original speaker embedding is discarded from the content encoder...
content encoder Without the speaker emb, it is harder for the encoder to learn that information from the spectrogram. Since you already have that info, just give it to the encoder so that the encoder dose not need to learn that information from the spectrogram.
@auspicious3000 its still not intuitive for me to understand what you said. Maybe I should read some related papers behind this theory. Do you have any recommended papers?
There must be papers describing this technique, but I don't know any of them. It should be very simple and intuitive. For example, you need to solve A and B, but if I give you the answer of B, you only need to solve A. That's it.
@auspicious3000 yes, from the whole network view, you are right. Suppose B is the emb_trg and A is original content, it is intuitive. My question is the why emb_org is still needed to cascade with the original spectrogram when training because emb_org and emb_trg are identical when traning. My thought is the content encoder can still learn to generate only the content information even if there is no emb_org cascaded with original spectrogram.
My explanation is for the encoder and I perfectly understood your question. Without feeding emb_org the encoder can learn to disentangle the content and identity, but it will be easier if the identity is already given.
@auspicious3000 here is where i got confused. The content encoder is just an encoder or just an network. An network can generate any possible output unless it get some guide. In my though, the "guide" should come from the BP algorithm not from the input. Now i'm starting to know why the deep learning is sometimes called "alchemy" in China in some case. The emb_org in this case is very similar with a material called "YaoYin" in Ancient China alchemy (Don't take it seriously, just a joke). Maybe my issue is the way of thinking. I have used to think everything in a linear way while the deep network is not a linear system.
@auspicious3000 here is where i got confused. The content encoder is just an encoder or just an network. An network can generate any possible output unless it get some guide. In my though, the "guide" should come from the BP algorithm not from the input. Now i'm starting to know why the deep learning is sometimes called "alchemy" in China in some case. The emb_org in this case is very similar with a material called "YaoYin" in Ancient China alchemy (Don't take it seriously, just a joke). Maybe my issue is the way of thinking. I have used to think everything in a linear way while the deep network is not a linear system.
thank you and the author. I think this idea which concatenate speaker embedding to mels is an outstanding innovation.
- when data is not so much, or we do not train it convergence, this idea is useful
- although speaker embedding can be the result of mels among NN encoder, but this idea is just like "struct output NN" or like guided attention like tacotron
Both you two are right~ And this is somehow a really good idea
@auspicious3000 here is where i got confused. The content encoder is just an encoder or just an network. An network can generate any possible output unless it get some guide. In my though, the "guide" should come from the BP algorithm not from the input. Now i'm starting to know why the deep learning is sometimes called "alchemy" in China in some case. The emb_org in this case is very similar with a material called "YaoYin" in Ancient China alchemy (Don't take it seriously, just a joke). Maybe my issue is the way of thinking. I have used to think everything in a linear way while the deep network is not a linear system.
thank you and the author. I think this idea which concatenate speaker embedding to mels is an outstanding innovation.
- when data is not so much, or we do not train it convergence, this idea is useful
- although speaker embedding can be the result of mels among NN encoder, but this idea is just like "struct output NN" or like guided attention like tacotron
Both you two are right~ And this is somehow a really good idea
I suddenly think out an example, multi-speaker ASR:
- If we just use mels or MFCC as input to the NN, it's ok when dataset is many people
- If we use mels or MFCC as input, different people train different ASR models, it's ok
- But the most way is, use mels and speaker id as input together, which means we want to get P(mel|speaker id), it will be more concentrated as well as shared weights among different people~