sherpa-onnx Incorrect phoneme handling (Kokoro-TTS)

Incorrect phoneme handling (Kokoro-TTS)

Open nubloso opened this issue 8 months ago • 1 comments

I've noticed that the Hugging Face Kokoro-TTS hosted on Spaces handles phonemes exceptionally well, distinguishing between cases like:

"read" (past vs. present tense)
"a project" vs. "to project"

However, the Sherpa-ONNX version does not seem to exhibit the same level of phoneme accuracy. The regular Kokoro-TTS uses Misaki G2P, but I’m unsure how phoneme generation is handled in Sherpa-ONNX or why the results differ.

For reference, I'm implementing this in Flutter and using the following model:
➡️ kokoro-multi-lang-v1_0.tar.bz2

Questions:

Is there a way to enable Misaki G2P in Sherpa-ONNX?
If not, what method does Sherpa-ONNX use for phoneme generation?
Since correct pronunciation depends on context, how can I achieve better phoneme accuracy? The lexicon file alone doesn’t seem sufficient.
Could you clarify how the gold-silver-bronze ranking system is implemented (if at all) in this model?

Any insights would be greatly appreciated!

Mar 13 '25 11:03 nubloso

sherpa-onnx sherpa-onnx copied to clipboard

Incorrect phoneme handling (Kokoro-TTS)

Questions:

sherpa-onnx
sherpa-onnx copied to clipboard