LocalAI icon indicating copy to clipboard operation
LocalAI copied to clipboard

LocalAI fails to unload model to make room for new model

Open Expro opened this issue 3 weeks ago • 4 comments

LocalAI version: v.3.7.0, hipblas image

Describe the bug If application requests new model and there is already another model loaded that takes up necessary VRAM, LocalAI fails to stop previous model to release enough VRAM for new model. User needs to manually log into into LocalAI and click Stop to release VRAM.

To Reproduce

  1. Load model that takes most of VRAM
  2. Try to load another model that takes most of VRAM (more than currently available)

Expected behavior LocalAI stops / unloads old model to make room for new model.

Expro avatar Nov 14 '25 09:11 Expro

I am experiencing the same issue. Same version and image, AMD. I have to restart the container in order for it to clear the used model and VRAM.

rpc error: code = Unknown desc = Exception calling application: HIP out of memory. Tried to allocate 62.57 GiB. GPU 0 has a total capacity of 7.98 GiB of which 4.70 GiB is free. Of the allocated memory 2.86 GiB is allocated by PyTorch, and 76.76 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

GreymatterPrison avatar Nov 15 '25 14:11 GreymatterPrison

Currently there are two ways to make LocalAI unload memory during runtime:

  • By setting LOCALAI_SINGLE_ACTIVE_BACKEND, which will make sure there is only one model loaded
  • By setting LOCALAI_WATCHDOG_IDLE=true to automatically unload inactive models after a specific amount of time (that you can specify with LOCALAI_WATCHDOG_IDLE_TIMEOUT). You can also enable LOCALAI_WATCHDOG_BUSY to make sure LocalAI terminates backends that are keeping the GPU busy for long time (you can set a timeout in LOCALAI_WATCHDOG_BUSY_TIMEOUT)

Note, there are no other mechanism because we can't guess reliably the VRAM usage across different backends (yet?). Some other considerations were made in https://github.com/mudler/LocalAI/issues/6068 and https://github.com/mudler/LocalAI/issues/5352

mudler avatar Nov 17 '25 14:11 mudler

Same issue

sanjaysinghmp09 avatar Nov 20 '25 14:11 sanjaysinghmp09

Currently there are two ways to make LocalAI unload memory during runtime:

  • By setting LOCALAI_SINGLE_ACTIVE_BACKEND, which will make sure there is only one model loaded
  • By setting LOCALAI_WATCHDOG_IDLE=true to automatically unload inactive models after a specific amount of time (that you can specify with LOCALAI_WATCHDOG_IDLE_TIMEOUT). You can also enable LOCALAI_WATCHDOG_BUSY to make sure LocalAI terminates backends that are keeping the GPU busy for long time (you can set a timeout in LOCALAI_WATCHDOG_BUSY_TIMEOUT)

Note, there are no other mechanism because we can't guess reliably the VRAM usage across different backends (yet?). Some other considerations were made in #6068 and #5352

I am following the docs which states there is a web UI for Runtime Settings under the management interface, yet when I try to access the management interface, I get a 404 error. https://localai.io/features/runtime-settings/#watchdog-settings

Running docker compose, is there something I am missing?

GreymatterPrison avatar Nov 22 '25 14:11 GreymatterPrison