tts-util-app Queue lines for faster processing

The main idea is to process the next line while the current line is being spoken.

I have started using SherpaTTS as my TTS provider. It's a local AI model for TTS which sounds a lot better than RHVoice or ESpeak NG. It can take a moment to start reading, but the way that your app splits the input by lines or sentences is really helpful for getting it started quickly, since it only has to process the first sentence, rather than the whole text.

However when using an AI model like this, it does take a moment to process, meaning there is a few seconds of pause between each sentence just waiting for the model to process the text. It would be nice if there was a sort of "read ahead" or "line queue" option that would allow for processing 1 or more sentences ahead of the text it is currently speaking.

I would imagine a feature like this to perform like the following. Say you have the Queue set to 1. You would send a paragraph to TTS Util, and it would instruct the TTS model to begin processing the first sentence. When the model is ready, TTS Util would begin playing the models output, and at this point TTS Util would send the second sentence to the model, and store that output to be played when the first sentence is finished. Once the second sentence is being spoken, it sends the third sentence for processing, and so on.

Raising the Queue number would increase the amount of sentences it will "read ahead" to be processed, while still only ever processing one sentence at a time.

Sorry for such a long request. I hope something like this is possible, as it would make the more natural sounding local offline AI models more practical as a privacy respecting replacement to google or other TTS.

May 31 '25 09:05 Gymcap

Hello Gymcap,

Thank you for opening this issue. I am not familiar with SherpaTTS, but your idea sounds good to me.

One obstacle I can see is a technical limitation of the Android text-to-speech API. Namely, that there is no appropriate method by which to have the connected engine pre-process input text, to have it ready, without speaking it when it reaches the head of queue — typically right away.

The best way around this, in my opinion, is to synthesise wave files for TTS Util to play directly instead, when appropriate. So in this case one wave file per sentence, but that depends on TTS Util's "Silent Utterances" settings. I've been wanting to do this any way, for greater user playback control — e.g. play/pause, rewind, fast-forward controls (issue #29).

So I will look into a new read-ahead setting for a future version of TTS Util. Probably version 5.0. A read-ahead setting of 0-5 text segments — lines, sentences, etc — seems sensible to me.

On Sat, 31 May 2025 02:57:44 -0700 Gymcap @.***> wrote:

Gymcap created an issue (drmfinlay/tts-util-app#46)

The main idea is to process the next line while the current line is being spoken.

I have started using SherpaTTS as my TTS provider. It's a local AI model for TTS which sounds a lot better than RHVoice or ESpeak NG. It can take a moment to start reading, but the way that your app splits the input by lines or sentences is really helpful for getting it started quickly, since it only has to process the first sentence, rather than the whole text.

However when using an AI model like this, it does take a moment to process, meaning there is a few seconds of pause between each sentence just waiting for the model to process the text. It would be nice if there was a sort of "read ahead" or "line queue" option that would allow for processing 1 or more sentences ahead of the text it is currently speaking.

I would imagine a feature like this to perform like the following. Say you have the Queue set to 1. You would send a paragraph to TTS Util, and it would instruct the TTS model to begin processing the first sentence. When the model is ready, TTS Util would begin playing the models output, and at this point TTS Util would send the second sentence to the model, and store that output to be played when the first sentence is finished. Once the second sentence is being spoken, it sends the third sentence for processing, and so on.

Raising the Queue number would increase the amount of sentences it will "read ahead" to be processed, while still only ever processing one sentence at a time.

Sorry for such a long request. I hope something like this is possible, as it would make the more natural sounding local offline AI models more practical as a privacy respecting replacement to google or other TTS.

-- Reply to this email directly or view it on GitHub: https://github.com/drmfinlay/tts-util-app/issues/46 You are receiving this because you are subscribed to this thread.

Message ID: @.***>

Jun 01 '25 03:06 drmfinlay

https://k2-fsa.github.io/sherpa/onnx/tts/apk-engine.html

Can you reproduce the long pauses issue using the APK from us (available from the above links.)?

Jun 27 '25 04:06 csukuangfj

Yes. I get the long pauses with the first APK (sherpa-onnx-1.12.3-arm64-v8a-en-tts-engine-kokoro-en-v0_19.apk) installed on an Android 15 device using your sample text: "How are you doing? This a text-to-speech engine using next generation Kaldi".

And as I wrote in the other issue (#49), these pauses do not occur in wave files synthesised by your engine.

Jul 04 '25 05:07 drmfinlay

Can you try a piper model?

kokoro is super slow compared with piper tts models.

You can verify that by looking at the RTF.

Jul 04 '25 05:07 csukuangfj

Okay, I have tried that and the pauses do not occur with the vits-piper-en_US-kusal-medium APK. I don't know what you mean by the RTF. As to the wave files, these pauses will be occurring during the file synthesis process, which does take a long time.

I'll recommend the piper models to TTS Util users in the project documentation and then close these issues. Thank you for your help, @csukuangfj.

Jul 04 '25 06:07 drmfinlay

In that case, the pauses are caused by using a large model with a slower CPU.

If it takes 1 second to synthesize a 10-second audio wave, we say its RTF (Real time factor) is

1/10 = 0.1

The lower the RTF, the faster the model.

So if you know the RTF is 0.1, if you want to synthesize a 5-second-long audio, it would take

0.1 * 5 = 0.5 seconds

If the RFT for a model is quite high, you would see the pauses.

Jul 04 '25 06:07 csukuangfj

If you use the APKs from sherpa-onnx, you should see the RTF printed on the user interface once you run it.

RTF is model-dependent and also device-dependent.

Jul 04 '25 06:07 csukuangfj

Sorry for the late reply. I did see the RTF printed in the sherpa-onyx UI. Thank you for explaining what it means.

Jul 29 '25 05:07 drmfinlay