sherpa-onnx icon indicating copy to clipboard operation
sherpa-onnx copied to clipboard

Sherpa-onnx Zipvoice pronounce "3" to "2"

Open geniusnut opened this issue 1 month ago • 3 comments

Using the python-api-example

python3 ./python-api-examples/offline-zeroshot-tts.py \
  --zipvoice-flow-matching-model sherpa-onnx-zipvoice-distill-zh-en-emilia/fm_decoder.onnx \
  --zipvoice-text-model sherpa-onnx-zipvoice-distill-zh-en-emilia/text_encoder.onnx \
  --zipvoice-data-dir sherpa-onnx-zipvoice-distill-zh-en-emilia/espeak-ng-data \
  --zipvoice-pinyin-dict sherpa-onnx-zipvoice-distill-zh-en-emilia/pinyin.raw \
  --zipvoice-tokens sherpa-onnx-zipvoice-distill-zh-en-emilia/tokens.txt \
  --zipvoice-vocoder sherpa-onnx-zipvoice-distill-zh-en-emilia/vocos_24khz.onnx \
  --prompt-audio sherpa-onnx-zipvoice-distill-zh-en-emilia/prompt.wav \
  --zipvoice-num-steps 4 \
  --num-threads 4 \
  --prompt-text "周日被我射熄火了,所以今天是周一。" \
  "除了中文数据集和3小时的英语之外,其余数据都留在本次训练中。目标是推动模型系列的发展,并最终恢复一些被遗留的声音。"

The "[3]小时" pronounced to "[2]". But it's OK for the Zipvoice demo python3 -m zipvoice.bin.infer_zipvoice_onnx .... I suspect if the token differs.

geniusnut avatar Oct 23 '25 09:10 geniusnut

Hi @geniusnut! I'm Dosu and I’m helping the sherpa-onnx team.

This kind of numeral mispronunciation is often caused by differences in tokenization or text normalization between scripts. Sherpa-onnx TTS models rely on tokens.txt and rule-based FSTs (like number.fst) for number normalization, which directly affects how numerals are pronounced. If the tokens.txt or number.fst used by the python-api-example script differs from the one used in the zipvoice demo, or if the script does not apply number normalization in the same way, you can get issues like "3" being pronounced as "2" source.

Check that both scripts use the exact same tokens.txt and number.fst files, and that the python-api-example script is configured to use number.fst for rule-based normalization. If the python-api-example does not have a --tts-rule-fsts or similar argument for number.fst, or if the tokens.txt files differ, aligning these should resolve the issue.

If you generated tokens.txt from different configs or model versions, that can also cause mismatches source. Double-check that the tokens.txt and number.fst are identical between both scripts and that number normalization is enabled in the python-api-example.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Join Discord Share on X

dosubot[bot] avatar Oct 23 '25 09:10 dosubot[bot]

@pkufool Please have a look

csukuangfj avatar Oct 24 '25 07:10 csukuangfj

@geniusnut The frontend mya be a little different, sherpa-onnx does less text normalization than python script in zipvoice, will have a look at this case.

pkufool avatar Oct 28 '25 14:10 pkufool