text-generation-webui
text-generation-webui copied to clipboard
CLI flag to offload weights to system RAM when not in use
Description
This feature request is to add a CLI flag to offload/cache weights to system RAM when the software is in an "idle" state. This also means in this idle state, VRAM should theoretically be completely empty. This of course would add more latency when you do want to run inference, but an example use case of what this allows for is a convenience of being able to do something like running a LLM and LDM sequentially without OOM issues.
Additional Context
This functionality differs from --auto-devices
because it would offload weights entirely to system RAM in an idle state. This would be similar to what the Stable Diffusion web UI offers via the --medvram
flag. https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Optimizations
only one [component] is in VRAM at all times, sending others to CPU RAM
I did open a discussion for this, but lack of activity has lead me to open an issue so this could be tracked.
In the sd_api_pictures extension you can run both SD_webui && Ooba at once. There's a checkbox to manage vram. Takes seconds to switch models located on an NVME.
https://github.com/oobabooga/text-generation-webui/blob/main/docs/Extensions.md
Interesting, thanks for sharing that. It's not 100% what I'm looking for, as I'd rather unload and reload weights of the text gen web UI from the SD web UI, not the other way around like that extension, but I can just add a few endpoints for calling unload_model
and reload_model
and that'll do it. I'll leave this issue open for a bit longer but will likely close it if no further discussion is brought up.
This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.