autovc icon indicating copy to clipboard operation
autovc copied to clipboard

confusion with speaker encoder and loss func

Open andylida opened this issue 5 years ago • 10 comments

thx for this code and i didn't find any implement of speaker encoder in demo is that unseen in this demo?

and in the loss func image i cant figure out the difference bewteen L'recon' and L'recon0'

thanks a lot for guides

andylida avatar Oct 02 '19 15:10 andylida

Please refer to #24 for speaker encoder. You don't need speaker encoder if you don't do zero-shot conversion. "During training, reconstruction loss is applied to both the initial and final reconstruction results"

auspicious3000 avatar Oct 02 '19 15:10 auspicious3000

So is that the initial one goes after lstmX3 and final one was tuned by residual block?

and in code what is emb_org and emb_trg? i thought the emb_trg is from speaker encoder, which is the output given emb_org as input if so,while converting,why concat the sourse speaker and target speaker to feed content encoder in code? the target speaker didnt give out any content

thanks for guides

andylida avatar Oct 03 '19 03:10 andylida

@andylida did you understand the difference between L_recon and L_recon0?

arunasank avatar Jun 12 '20 21:06 arunasank

@arunasank As I understand L_recon is after postnet mel and L_recon0 is pre postnet mel. Tacotron 2 used the similar postnet structure and loss.

CODEJIN avatar Jun 14 '20 03:06 CODEJIN

Hi @CODEJIN. I have read the AutoVc and Tacotron papers. However neither seem to provide much information about why a postnet is used in the first place. Where can I learn more about this? I am wondering why it is necessary - because when I attempt to train my AutoVc models, the postnet very quickly is trained to output nothing but low 0-mean values while the prenet output generates all the mel spectrograms visible detail.

Currently with the AutoVc models I have trained, the postnet is only providing a very faint output. In the diagrams below, the 1st row shows the original x_input data, 2nd is the L_recon0, 3rd is L_recon (if the images were scaled between 0 and 1, these images would be almost totally blackened) and 4th is combined prenet and postnet output which looks identical to the 2nd row.

500000iterations

The postnet seems to output nothing but seemingly negligible values after 10k iterations (the figure shown is actually after 584k iterations). Does anyone have any thoughts on this? I would love to know where I can learn more about the use of prenets. Thanks for taking the time to read this far if so! 👍

Trebolium avatar Dec 12 '20 21:12 Trebolium

@Trebolium The Tacotron2 paper does not specifically mention the purpose of Postnet. My personal guess is that it increases the detail of the mel. Postnet is a residual structure(postnet = f(x) +x), so it provides little additional information to the pre-mel. As a result, postnet increases the detail of the mel, and in the case of the trained model, postnet usually shows lower loss than prenet.

CODEJIN avatar Dec 13 '20 03:12 CODEJIN

Do you know where I could learn more about postnet implementation? Its a tricky thing to just google. Thanks for replying so quickly!

On Sun, Dec 13, 2020 at 3:16 AM Heejo You [email protected] wrote:

@Trebolium https://github.com/Trebolium The Tacotron2 paper does not specifically mention the purpose of Postnet. My personal guess is that it increases the detail of the mel. Postnet is a residual structure(postnet = f(x) +x), so it provides little additional information to the pre-mel. As a result, postnet increases the detail of the mel, and in the case of the trained model, postnet usually shows lower loss than prenet.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/auspicious3000/autovc/issues/29#issuecomment-743941087, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIMKTH4BCVSOJ4BDZUNCIC3SUQWXNANCNFSM4I4XVIXQ .

Trebolium avatar Dec 13 '20 14:12 Trebolium

Hi @CODEJIN. I have read the AutoVc and Tacotron papers. However neither seem to provide much information about why a postnet is used in the first place. Where can I learn more about this? I am wondering why it is necessary - because when I attempt to train my AutoVc models, the postnet very quickly is trained to output nothing but low 0-mean values while the prenet output generates all the mel spectrograms visible detail.

Currently with the AutoVc models I have trained, the postnet is only providing a very faint output. In the diagrams below, the 1st row shows the original x_input data, 2nd is the L_recon0, 3rd is L_recon (if the images were scaled between 0 and 1, these images would be almost totally blackened) and 4th is combined prenet and postnet output which looks identical to the 2nd row.

500000iterations

The postnet seems to output nothing but seemingly negligible values after 10k iterations (the figure shown is actually after 584k iterations). Does anyone have any thoughts on this? I would love to know where I can learn more about the use of prenets. Thanks for taking the time to read this far if so! 👍

Tacotron's postnets is beacuase: tacotron generate mels first by auto-regressive, mels just condition mels before. And in another way of understanding, mels before postnet need to do two things: (1) mel's content (2) for regressive; But actrually mels are (1) Related to both front and back (2) just need to model mel's content. So use a CNN model as postnet, to make mels better.

But author's NN may not need postnet if LSTM is Two-way. And I find that this code is not the same as paper's postnet, this code has lstm in postnet.

ruclion avatar Dec 23 '20 09:12 ruclion

@ruclion Interesting. Now I don't know the clear reason. I think it would be better to ask the author of the paper (the owner of this repository).... :) And, please let me know where the LSTM is in this code.... When I checked this repo, the postnet was here which there are only several convolution layer.

CODEJIN avatar Jan 02 '21 20:01 CODEJIN

@ruclion Interesting. Now I don't know the clear reason. I think it would be better to ask the author of the paper (the owner of this repository).... :) And, please let me know where the LSTM is in this code.... When I checked this repo, the postnet was here which there are only several convolution layer.

yeah, you are right. author's postnet is CNN, no LSTM. haha~ thank you~

ruclion avatar Jan 06 '21 07:01 ruclion