NATSpeech icon indicating copy to clipboard operation
NATSpeech copied to clipboard

accuracy compares with VITS?

Open lucasjinreal opened this issue 3 years ago • 3 comments

accuracy compares with VITS? does it faster and accurator?

lucasjinreal avatar Feb 14 '22 05:02 lucasjinreal

We didn't compare with VITS in our paper quantitatively. Perceptually, the sound quality and prosody of PortaSpeech and DiffSpeech are similar with VITS. The differences between our models and VITS are:

  • Our models are acoustic models, which need an external vocoder (HiFi-GAN in our case) to build a TTS system, while VITS is an end-to-end TTS model.
  • Our models can train faster than the end-to-end TTS model since we can fix the vocoder part, which makes the model tuning easier.
  • End-to-end TTS models do not need to consider the quality gap between the generated and ground-truth intermediate features (mel-spectrogram), which avoids the error propagation problem.

So e2e TTS model and non-e2e TTS model have different usage scenarios.

RayeRen avatar Feb 14 '22 06:02 RayeRen

i think PortaSpeech + MB MelGAN can work on cpu very well, VITS need gpu to work.

MaxMax2016 avatar Feb 14 '22 06:02 MaxMax2016

i think PortaSpeech + MB MelGAN can work on cpu very well, VITS need gpu to work.

Yes, you can use other vocoders with PortaSpeech.

RayeRen avatar Feb 14 '22 06:02 RayeRen