NATSpeech
NATSpeech copied to clipboard
accuracy compares with VITS?
accuracy compares with VITS? does it faster and accurator?
We didn't compare with VITS in our paper quantitatively. Perceptually, the sound quality and prosody of PortaSpeech and DiffSpeech are similar with VITS. The differences between our models and VITS are:
- Our models are acoustic models, which need an external vocoder (HiFi-GAN in our case) to build a TTS system, while VITS is an end-to-end TTS model.
- Our models can train faster than the end-to-end TTS model since we can fix the vocoder part, which makes the model tuning easier.
- End-to-end TTS models do not need to consider the quality gap between the generated and ground-truth intermediate features (mel-spectrogram), which avoids the error propagation problem.
So e2e TTS model and non-e2e TTS model have different usage scenarios.
i think PortaSpeech + MB MelGAN can work on cpu very well, VITS need gpu to work.
i think PortaSpeech + MB MelGAN can work on cpu very well, VITS need gpu to work.
Yes, you can use other vocoders with PortaSpeech.