efficient_tts icon indicating copy to clipboard operation
efficient_tts copied to clipboard

Reproducing good results (as claimed in paper)

Open ctlaltdefeat opened this issue 3 years ago • 8 comments

Somewhat related to issue #2 which was closed, but I think it's safe to say that the latest samples posted do not seem to be close to converging towards the strong results that were claimed by the paper's authors, and it would be good to have an issue tracking speech quality.

It's somewhat puzzling given that the implementation seems to be on point except for the missing hyperparameter sigma values that you mentioned. I'm doing my own experiments playing with hyperparameters but haven't been able so far to achieve something too competitive. If you have any ideas of what could be tried, let me know.

ctlaltdefeat avatar Dec 25 '20 03:12 ctlaltdefeat

Thank you very much for the attention and sorry for this late reply. I contacted with the authors of the paper. I'd like to posted their reply here for your reference:

  • Sigmas in Equation 14 and 17 are 0.2 and 0.1, respectively
  • Text encoder does not have two output streams, i.e., key = value.
  • Hidden dimensionalities of the position predictor is 384, 256.
  • Input text sequences have <space> as leading and tailing tokens.
  • The authors use a dropout rate of 0.2.
  • LeakyReLU has negative slope 0.2.

However, the generated samples uploaded in this repo are the best ones I have got (the yaml config file lies in the egs folder). Hope this can help us to obtain better results.

liusongxiang avatar Jan 07 '21 10:01 liusongxiang

I've been trying these values without much luck. Do we know where the authors used dropout? Perhaps dropout was used only in some of the layers of some of the components.

ctlaltdefeat avatar Jan 09 '21 19:01 ctlaltdefeat

I've been trying these values without much luck. Do we know where the authors used dropout? Perhaps dropout was used only in some of the layers of some of the components.

Similar results, there may be some other tricks not claimed in the paper.

attitudechunfeng avatar Jan 10 '21 05:01 attitudechunfeng

@liusongxiang Did you train the dataset of Biaobei using the same config as ./egs/lj/conf/efficient_tts_cnn_phnseq_noDropout.v1.yaml?

Liujingxiu23 avatar Feb 25 '21 09:02 Liujingxiu23

@Liujingxiu23 Yes, exactly.

liusongxiang avatar Feb 25 '21 11:02 liusongxiang

@Liujingxiu23 Thanks for your attention. I haven't try the end2end training yet since I have been stuck in other things. If you are interested, I think you could try by combining this repo with the ParallelWaveGAN repo.

liusongxiang avatar Feb 26 '21 01:02 liusongxiang

@liusongxiang I see you implement 2 delta_e prediction methods and which delta e prediction method do you use? delta_e_method_1 or another one?

attitudechunfeng avatar Mar 02 '21 12:03 attitudechunfeng

Hi all, all the parameters are shown as in efficient_tts/egs/lj/conf/efficient_tts_cnn_phnseq_noDropout.v1.yaml . I have tried other settings, but this seems like the best one for both LJSpeech and Biaobei data set.

liusongxiang avatar Mar 02 '21 12:03 liusongxiang