Any way to run multiple clients concurrently?
I want to reduce the latency as much as possible and increase the maximum number of users it can support concurrently on one RTX 5090. The model fits in VRAM almost 8 times but running multiple workers slows the generation times way down such that it's not helpful (at least the way I'm doing it).
I've tried chunking and streaming but the generated text to real time playback time is already close to 1:1 with one user.
Ai suggested that I might be able to "export to TensorRT and use multiple execution contexts" but couldn't confirm if this was possible with chatterbox, so I haven't pursued it, not knowing much about this.
Thanks for the great model, rivals the SOTA paid services
For non-multilingual, try Chatterbox-VLLM (throughput) or my fork chatterbox fast (focused on latency)
Hey, you can also load some instances, the GPU load on 3090 can handle at least two concurrent chatterbox instances, it is what Im doing. Im testing more anyways so will see.. Im going to export to TensorRT also and see if I can make it possible.