LocalAI
LocalAI copied to clipboard
LocalAI fails to unload model to make room for new model
LocalAI version: v.3.7.0, hipblas image
Describe the bug If application requests new model and there is already another model loaded that takes up necessary VRAM, LocalAI fails to stop previous model to release enough VRAM for new model. User needs to manually log into into LocalAI and click Stop to release VRAM.
To Reproduce
- Load model that takes most of VRAM
- Try to load another model that takes most of VRAM (more than currently available)
Expected behavior LocalAI stops / unloads old model to make room for new model.
I am experiencing the same issue. Same version and image, AMD. I have to restart the container in order for it to clear the used model and VRAM.
rpc error: code = Unknown desc = Exception calling application: HIP out of memory. Tried to allocate 62.57 GiB. GPU 0 has a total capacity of 7.98 GiB of which 4.70 GiB is free. Of the allocated memory 2.86 GiB is allocated by PyTorch, and 76.76 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Currently there are two ways to make LocalAI unload memory during runtime:
- By setting
LOCALAI_SINGLE_ACTIVE_BACKEND, which will make sure there is only one model loaded - By setting
LOCALAI_WATCHDOG_IDLE=trueto automatically unload inactive models after a specific amount of time (that you can specify withLOCALAI_WATCHDOG_IDLE_TIMEOUT). You can also enableLOCALAI_WATCHDOG_BUSYto make sure LocalAI terminates backends that are keeping the GPU busy for long time (you can set a timeout inLOCALAI_WATCHDOG_BUSY_TIMEOUT)
Note, there are no other mechanism because we can't guess reliably the VRAM usage across different backends (yet?). Some other considerations were made in https://github.com/mudler/LocalAI/issues/6068 and https://github.com/mudler/LocalAI/issues/5352
Same issue
Currently there are two ways to make LocalAI unload memory during runtime:
- By setting
LOCALAI_SINGLE_ACTIVE_BACKEND, which will make sure there is only one model loaded- By setting
LOCALAI_WATCHDOG_IDLE=trueto automatically unload inactive models after a specific amount of time (that you can specify withLOCALAI_WATCHDOG_IDLE_TIMEOUT). You can also enableLOCALAI_WATCHDOG_BUSYto make sure LocalAI terminates backends that are keeping the GPU busy for long time (you can set a timeout inLOCALAI_WATCHDOG_BUSY_TIMEOUT)Note, there are no other mechanism because we can't guess reliably the VRAM usage across different backends (yet?). Some other considerations were made in #6068 and #5352
I am following the docs which states there is a web UI for Runtime Settings under the management interface, yet when I try to access the management interface, I get a 404 error. https://localai.io/features/runtime-settings/#watchdog-settings
Running docker compose, is there something I am missing?