efficient_tts
efficient_tts copied to clipboard
Reproducing good results (as claimed in paper)
Somewhat related to issue #2 which was closed, but I think it's safe to say that the latest samples posted do not seem to be close to converging towards the strong results that were claimed by the paper's authors, and it would be good to have an issue tracking speech quality.
It's somewhat puzzling given that the implementation seems to be on point except for the missing hyperparameter sigma values that you mentioned. I'm doing my own experiments playing with hyperparameters but haven't been able so far to achieve something too competitive. If you have any ideas of what could be tried, let me know.
Thank you very much for the attention and sorry for this late reply. I contacted with the authors of the paper. I'd like to posted their reply here for your reference:
- Sigmas in Equation 14 and 17 are 0.2 and 0.1, respectively
- Text encoder does not have two output streams, i.e., key = value.
- Hidden dimensionalities of the position predictor is 384, 256.
- Input text sequences have <space> as leading and tailing tokens.
- The authors use a dropout rate of 0.2.
- LeakyReLU has negative slope 0.2.
However, the generated samples uploaded in this repo are the best ones I have got (the yaml config file lies in the egs folder). Hope this can help us to obtain better results.
I've been trying these values without much luck. Do we know where the authors used dropout? Perhaps dropout was used only in some of the layers of some of the components.
I've been trying these values without much luck. Do we know where the authors used dropout? Perhaps dropout was used only in some of the layers of some of the components.
Similar results, there may be some other tricks not claimed in the paper.
@liusongxiang Did you train the dataset of Biaobei using the same config as ./egs/lj/conf/efficient_tts_cnn_phnseq_noDropout.v1.yaml?
@Liujingxiu23 Yes, exactly.
@Liujingxiu23 Thanks for your attention. I haven't try the end2end training yet since I have been stuck in other things. If you are interested, I think you could try by combining this repo with the ParallelWaveGAN repo.
@liusongxiang I see you implement 2 delta_e prediction methods and which delta e prediction method do you use? delta_e_method_1 or another one?
Hi all, all the parameters are shown as in efficient_tts/egs/lj/conf/efficient_tts_cnn_phnseq_noDropout.v1.yaml . I have tried other settings, but this seems like the best one for both LJSpeech and Biaobei data set.