example_fast_api/async_server.py with kokoro: concurrent generations are mixed

Open jlmeunier opened this issue 8 months ago • 1 comments

Hi,

My goal is to process multiple requests, in streaming mode to reduce latency. I then looked at async_server.py and got questions.

I'm unsure about the purpose of the async_server.py example because a comment inside says coqui doesn't support multiple queries, but kokoro does not better. In fact, when 2 browser clients send their text at almost same time, their respective outputs are not kept separated. They are mixed, 1 client getting some speech that should have gone to the other one. (Note: With server.py, the 2nd client would have received a HTTP 503.)

I saw your explanations in issue #286 on concurrency support, so I deduce that the same Kokoro engine cannot handle multiple request. Correct?

But then , assuming I setup a pool of Kokoro engines, what's the proper way to know that an engine has finished streaming its output? Any hint? For now my best guess is to imitate what you did in server.py, i.e. at the end of play_text_to_speech function, I would mark the used engine as "idle".

Does-it sound appropriate to you?

Thanks, JL

Apr 24 '25 12:04 jlmeunier

Probably using a single engine makes the most sense. That avoids loading the model multiple times into VRAM. Probably best then to create a worker thread that subsequently processes all incoming user synthesis requests. I'd synthesize them one by one with the play method within that worker. When processing returned from the play method I'd immediately start the next synthesis.

I don't think synthesizing in parallel makes sense, it only slows every single synthesis processing down. If you still need this, try to create multiple engines. Like a pool of 2, 3 or 4 and then distribute the user requests.

Apr 24 '25 12:04 KoljaB