Kokoro-FastAPI icon indicating copy to clipboard operation
Kokoro-FastAPI copied to clipboard

Pause insertion request

Open riqbelcher opened this issue 11 months ago • 16 comments

Would it be possible to add a feature to support the addition of pauses when encountering e.g. [PAUSE 1.0] , which could introduce in this case a 1 second pause before continuing with the next section of text?

riqbelcher avatar Feb 11 '25 20:02 riqbelcher

Can take a look. It should also break for approximately that long on newline '\n'

remsky avatar Feb 12 '25 09:02 remsky

Hi Remsky,

Thank you that’s great news.

I have another idea. It’s a little complex to describe, but it’s actually straightforward.

I’d like to use your software to create audiobooks with a little more depth.

Basically, the feature would be: each character that talks will have his/her unique voice/accent.

On the software side, it would need to support ‘n’number of characters, and each character to have a configurable voice/mix of voices.

The user of your software would need to add markers for ‘narrator’, and characters… e.g. [1] .. [n]

The user would then define somewhere in the software which number and what voice.

I already have DeepSeek automatically marking each character with the relevant brackets, and have tested on a number of chapters from different sources.

The result would be an audio production with different voices for each of the characters…

I think this would be a unique feature, one which I have not seen done anywhere else. What do you think?

Is this the sort of thing you could imagine building into FastAPI ?

Best regards

Richard

From: remsky @.> Reply to: remsky/Kokoro-FastAPI @.> Date: Wednesday, 12. February 2025 at 10:01 To: remsky/Kokoro-FastAPI @.> Cc: riqbelcher @.>, Author @.***> Subject: Re: [remsky/Kokoro-FastAPI] Pause insertion request (Issue #161)

Can take a look. It should also break for approximately that long on newline '\n'

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

riqbelcher avatar Feb 12 '25 21:02 riqbelcher

Conceptually, I'm all for this extension. I've been using various non-AI TTS apps to read back books.. Voices range from tolerable at best to great but too spendy.

While Fasy Koko, etc. aren't great on expressions, the voices are the best to date,, Expression, though welcome, isn't mission critical. I'd be interested in an simple markup scheme as well as a way to save a markup script.

Regarding pausing on \n, that would be a disaster with some text formats I encounter. Please include a way to disable pauses on \n.

RBEmerson970 avatar Feb 12 '25 22:02 RBEmerson970

@riqbelcher I've actually been using a similar idea on my own work, and shouldn't be too bad to implement here as a development endpoint. I've been using an XML style tag. Most LLM's should be familiar with it. and not find it too hard to adapt the tagging. Think something like this would work for your use case? I can try to make it more flexible

e.g.

<voice name="af_heart">
  Once upon a time...
</voice>
<voice name="am_micheal(1)+af_nicole(2)">
  Hello there!
</voice> 

@RBEmerson970 The newline split is built into it pretty far down. I can take a look though, worst case could try and add something to force it, but it may be simpler to add functionality to just replace \n with " " in the WebUI, or could do so now via script via the API.

I'm curious about what type of format would be disastrous with the pause on newline though, it could be I'm not fully understanding.

remsky avatar Feb 13 '25 06:02 remsky

@remsky I like your XML approach idea. Inserting pauses would then be straightforward. The objective is to be able to selectively place a pause of any duration, at any point.. If an ebook is preprocessed so that in this case 'pauses are placed prior to having the TTS process the file. The script could apply specific pause durations for commas, and other punctuation. ... This XML approach would be useful in future iterations where other tags could be implemented, like specific voices, languages, accents, and maybe even emotions.

For now, the user would create their own script to preprocess the ebook.

riqbelcher avatar Feb 19 '25 18:02 riqbelcher

I too would like to see a pause tag, one that can be use set for however long I want the pause to be, like PAUSE_2.5 or PAUSE_ 1.25 or whatever. Someone already actually implemented PAUSE_x tags in kokoro so maybe have a look at their files to get an idea how to implement it with Kokoro-FastAPI? I would just use that one but it has issues for me and I prefer a web interface anyhow.

What I don't understand is since Kokoro uses espeak-ng then why can it not use SSML tags like <break time=2s>? Because espeak-ng does have SSML support. I have tried many Kokoro gits and none of them will do SSML.

vrepub avatar Mar 14 '25 06:03 vrepub

It works fine for me, with these control characters in the text: [voice=custom_mix] [pause=0.8] Yes. So simple! [voice=af_sky] [pause=1.7] But disturbing thoughts continue to arise in that. [voice=custom_mix] [pause=0.8] How do you know they are disturbing?

In addition, I use a python3 GUI that Grok3 created for me, in which I can set the pause length after a punctuation mark, the read speed and Voice Blending

Image

The GUI can also process several tasks in parallel. See Screenshot.

Patrick-Ric avatar Apr 28 '25 05:04 Patrick-Ric

No, does not work!

Listen: https://voca.ro/1caAjCpCsnxv

Hi I like you [pause=3] Yes. So simple! [pause=1.7] But disturbing thoughts continue to arise in that. [pause=2.8] How do you know they are disturbing?

It works fine for me, with these control characters in the text: [voice=custom_mix] [pause=0.8] Yes. So simple! [voice=af_sky] [pause=1.7] But disturbing thoughts continue to arise in that. [voice=custom_mix] [pause=0.8] How do you know they are disturbing?

vrepub avatar Apr 28 '25 06:04 vrepub

It works immediately with me (I used the "Kokoro TTS GUI" for that), Listen: https://voca.ro/1lnLzOohGw7w (I used the "Kokoro TTS GUI".) For that I used a text file and it is important to pay attention to the exact formatting. The control characters must appear alone at the beginning of a line.

Patrick-Ric avatar Apr 30 '25 10:04 Patrick-Ric

The [pause=..] syntax doesn't seem to work for me either. The pause tags get read by the model.

Code used
from pathlib import Path

from openai import OpenAI

HERE = Path(__file__).parent

TEXT = """Yes. So simple!
[pause=1.7]
But disturbing thoughts continue to arise in that.
"""


def main():
    client = OpenAI(base_url="http://kokoro.orb.local/v1", api_key="not-needed")
    output_path = HERE / "output.mp3"
    with client.audio.speech.with_streaming_response.create(
        model="kokoro",
        voice="af_heart",
        input=TEXT,
    ) as response:
        response.stream_to_file(output_path)


if __name__ == "__main__":
    main()

sloria avatar May 13 '25 21:05 sloria

The [pause=..] syntax doesn't seem to work for me either. The pause tags get read by the model. Code used

I think you should try it with the GUI. However, I must add that adding pauses can lead to poorer pronunciation, especially in short sentences. It is therefore only advisable if it is urgently needed. I therefore do not insert so many pauses, e.g. only for a new chapter and a change of speaker.

Patrick-Ric avatar May 13 '25 23:05 Patrick-Ric

It sounds like the client you're using is handling the pause tags. My use case requires programmatic access to kokoro, so the pause handling would have to happen on the backend (i.e. this repo).

sloria avatar May 14 '25 04:05 sloria

It sounds like the client you're using is handling the pause tags. My use case requires programmatic access to kokoro, so the pause handling would have to happen on the backend (i.e. this repo).

I understand. Of course, it would be great if kokoro could manage this directly.

Patrick-Ric avatar May 14 '25 14:05 Patrick-Ric

My use case requires programmatic access to kokoro, so the pause handling would have to happen on the backend (i.e. this repo).

Well. That is what you would like to have happen. You could also process the text programatically and make separate requests to the backend for each segment of speech and then combine them by inserting whatever amounts of silence you want in between, and use whatever voices you want for each.

I'd caution against confusing a request with necessity.

therealpygon avatar May 20 '25 15:05 therealpygon

👍 for sure. i was being imprecise in my language. built-in pause handling is certainly more of a want than a need

sloria avatar May 20 '25 16:05 sloria

hi there - if kokoro would understand ssml, that would be a gamechanger!

like this: <speak> Hello world. <break time="1000ms"/> This is a short SSML example — with a one second pause. </speak>

thojank avatar Oct 16 '25 11:10 thojank