echogarden icon indicating copy to clipboard operation
echogarden copied to clipboard

Synthesis: VITS voices have various issues related to model training

Open rotemdan opened this issue 2 years ago • 0 comments

For example, when the default English voice (Amy / Low) gets an utterance that is a single word, like "two", it seems to mispronounce it as something that sounds closer to "ten". Other voices have much more serious issues. For example, the Greek voice may produce bizarre, nonsensical utterances when given English text (most likely it hasn't been trained for English, or Latin characters in general, and doesn't know what to do).

This is an issue with the training of the models, not related to the code itself.

These models are trained as part of the Piper speech system, mostly by Michael Hansen. You can check out the Piper issue tracker to give feedback on these sorts of problems.

Echogarden doesn't actually use the Piper system, but reimplements it in JavaScript, with several enhancements that are not present in the original C++ code. Only the ONNX models are shared.

The original ONNX models are published on the piper-voices Hugging Face repository. I repackage them as tar.gz archives and upload them to the echogarden-packages Hugging Face repository, from which they (and all other packages) are downloaded when needed.

rotemdan avatar Jul 28 '23 18:07 rotemdan