Kokoro-FastAPI icon indicating copy to clipboard operation
Kokoro-FastAPI copied to clipboard

Provide a means to unload the model from GPU memory like `OLLAMA_KEEP_ALIVE`

Open johnnyfleet opened this issue 10 months ago • 5 comments

I have the container running on a home server with nvidia GPU. Most of the time it is idle and when so its consuming a decent amount of GPU memory.

Can we introduce logic to unload from GPU memory after a set timeout period.

Similar to how Ollama unloads itself - which can be configured through the OLLAMA_KEEP_ALIVE variable.

See https://github.com/ollama/ollama/blob/main/docs/faq.md for more details.

Describe alternatives you've considered I can stop it but I'd prefer to have it there always - with a slight warm up time if idle for a period.

Additional context Seems to be a well known approach ollama and openwebui use as well to be kind and free up memory when idle.

johnnyfleet avatar Feb 25 '25 23:02 johnnyfleet

This feature is crucial for such always-running tasks that take GPU memory when idle.

I usually do this by creating a separate child process for each model run so that the GPU memory is released after the child process is finished.

Anyone has any idea how to implement this idea in this project?

Ali-Flt avatar Apr 02 '25 00:04 Ali-Flt

Working with 16 Gb of VRAM for a 4090, I've never had a problem with keeping Fast Koko awake while using Open WebUI running Llama3.3:70b under Docker under Win 11. If this "improvement" doesn't break anything, sobeit, but "if it ain't broke, don't fix it".

RBEmerson970 avatar Apr 02 '25 01:04 RBEmerson970

@RBEmerson970 some people myself included run these services on machines that are used for multiple purposes and by multiple people. The same machine might be used for other GPU intensive tasks that require as much memory as available (fine tuning, etc.). In such cases idle services like this become important.

It's definitely a cruical missing feature imo and why would it break anything anyway?

Ali-Flt avatar Apr 02 '25 06:04 Ali-Flt

I decided to write my own Kokoro TTS server (which is far simpler than this project obviously) that runs each TTS request in another process so that the model is unloaded immediately.

If it might help anyone here is the code.

The dockerfile and docker-compose can also be found in the same repository.

Ali-Flt avatar Apr 02 '25 19:04 Ali-Flt

Would be nice to have this. I want it not to only free up memory but also to reduce power usage. On a P4 that idols at 6-7 watts it goes to 24 watts just having the model in memory. On a P40 that idols at 9watts it goes to 50 watts having this model loaded. So would be great if after a set time it could unload. I would rather not have it unload immediately but after 5 minutes would be great. I have 5 gpus in my server but have about a dozen applications that use them. Even Immich now unloads it's models used for face recognition and item detection in your photos.

Mikec78660 avatar Jun 25 '25 13:06 Mikec78660