AdaSpeech Conditional Layer Normalization

Hi, I followed your work for several months and really pleasantly surprised at your speed of tracking the new algorithm. For the Adaspeech, have your verify that the two acoustic encoder really help the training of custom speakers? How it is compared to speaker-embedding generated by speaker-encoder using in speaker verification task? And for the "Conditional Layer Normalization", you have not implement it ,right? Is the following reference suitable if I realize it myself? Or Can you give amy suggest to do this? https://github.com/exe1023/CBLN/blob/e395edc2d6d952497b411f81eae4aafb96749bc2/model/cbn.py https://github.com/CyberZHG/torch-layer-normalization/blob/master/torch_layer_normalization/layer_normalization.py

Mar 17 '21 04:03 Liujingxiu23

In my opinion, utterance level encoder is alternative to an extern speaker encoder model. So if you could use an extern speaker encoder model to extract speaker embedding maybe better.

Mar 17 '21 12:03 hoyden

@Liujingxiu23 https://github.com/CyberZHG/torch-layer-normalization/blob/master/torch_layer_normalization/layer_normalization.py this works good. Yes speaker embedding generated by speaker encoder using in speaker verification works.

May 04 '21 23:05 rishikksh20

@rishikksh20 Thank you for your reply. I am trying this and other similar methods to relize personalized-tts that use mobile phone to record audios of users. But the results are not very good, shack and unstabitily are the main problems of synthesized wavs. I am wondering if it is the problem of vocoder, I could not find a universal vocoder using deep learning method.

May 08 '21 07:05 Liujingxiu23

My experiments showd that in a multi-speaker senario the phoneme level mel encoder encodes too much infomation. As a consequence if the phoneme level predictor is not capable enough the performance drops a lot.

Mar 03 '22 05:03 MMingabc

AdaSpeech AdaSpeech copied to clipboard

Conditional Layer Normalization

AdaSpeech
AdaSpeech copied to clipboard