NATSpeech accuracy compares with VITS?

accuracy compares with VITS?

Open lucasjinreal opened this issue 3 years ago • 3 comments

accuracy compares with VITS? does it faster and accurator?

Feb 14 '22 05:02 lucasjinreal

We didn't compare with VITS in our paper quantitatively. Perceptually, the sound quality and prosody of PortaSpeech and DiffSpeech are similar with VITS. The differences between our models and VITS are:

Our models are acoustic models, which need an external vocoder (HiFi-GAN in our case) to build a TTS system, while VITS is an end-to-end TTS model.
Our models can train faster than the end-to-end TTS model since we can fix the vocoder part, which makes the model tuning easier.
End-to-end TTS models do not need to consider the quality gap between the generated and ground-truth intermediate features (mel-spectrogram), which avoids the error propagation problem.

So e2e TTS model and non-e2e TTS model have different usage scenarios.

Feb 14 '22 06:02 RayeRen

i think PortaSpeech + MB MelGAN can work on cpu very well, VITS need gpu to work.

Feb 14 '22 06:02 MaxMax2016

i think PortaSpeech + MB MelGAN can work on cpu very well, VITS need gpu to work.

Yes, you can use other vocoders with PortaSpeech.

Feb 14 '22 06:02 RayeRen

NATSpeech NATSpeech copied to clipboard

accuracy compares with VITS?

NATSpeech
NATSpeech copied to clipboard