sherpa-onnx icon indicating copy to clipboard operation
sherpa-onnx copied to clipboard

Incorrect phoneme handling (Kokoro-TTS)

Open nubloso opened this issue 8 months ago • 1 comments

I've noticed that the Hugging Face Kokoro-TTS hosted on Spaces handles phonemes exceptionally well, distinguishing between cases like:

  • "read" (past vs. present tense)
  • "a project" vs. "to project"

However, the Sherpa-ONNX version does not seem to exhibit the same level of phoneme accuracy. The regular Kokoro-TTS uses Misaki G2P, but I’m unsure how phoneme generation is handled in Sherpa-ONNX or why the results differ.

For reference, I'm implementing this in Flutter and using the following model:
➡️ kokoro-multi-lang-v1_0.tar.bz2

Questions:

  1. Is there a way to enable Misaki G2P in Sherpa-ONNX?
  2. If not, what method does Sherpa-ONNX use for phoneme generation?
  3. Since correct pronunciation depends on context, how can I achieve better phoneme accuracy? The lexicon file alone doesn’t seem sufficient.
  4. Could you clarify how the gold-silver-bronze ranking system is implemented (if at all) in this model?

Any insights would be greatly appreciated!

nubloso avatar Mar 13 '25 11:03 nubloso