text-generation-webui
text-generation-webui copied to clipboard
Unload and reload models on request
An important step towards optimizing running different neural networks in parallel on the same GPU.
The core idea and the usage case is simple: when oobabooga is used alongside other memory hogs like Stable Diffusion (sd-api-pictures extension) or Tortoise-TTS (not yet implemented) this simple unload function leaves a lot more video memory for those other neural networks to work with. Once they finish their jobs, the LLM can be returned back to VRAM.
This is the first one of the possible improvements to #309 memory handling.
Tested on my machine, unloading Pyg-2.7B-8bit is almost instant, loading it back (from the RAM cache) takes ~7 seconds which I consider to be an acceptable delay compared to the image generation itself.
Pyg-6B-8bit is a bit slower but still tolerable.


I like the reloading idea as I've been switching to another model then returning to the updated model.
I'm not sure what the app state would be in an 'unloaded' condition, perhaps we just need the reload implementation?
Well, the state after unloading the checkpoint would be undetermined. One won't be able to generate a response, yet the generated error is not fatal and one can resume the chat texgen once the model is loaded back in, that much I tested.
The core idea and the usage case is simple: when oobabooga is used alongside other memory hogs like Stable Diffusion (sd-api-pictures extension) or Tortoise-TTS (not yet implemented) this simple unload function leaves a lot more videomemory for those other neural networks to work with. Once they finish their jobs, the LLM can be returned back to VRAM.
In the latest gradio version, there is now this circle icon in dropdown menus that unselects the currently selected option. I have modified the PR for using this button to unload the model from memory.

Your buttons were more functional because they allowed the very same model to be reloaded without having to locate it in the dropdown list, but I found that they occupied a lot of space while being a very niche feature. It should still be possible to create unload/reload buttons inside an extension.
That's a nice way to save space!
Though I'll still need the reload_model() function in server.py, as it would be called in extension which is trying to manage VRAM. I'll just introduce it as part of sd-api-pics update, this will make more sense in context
The unload_model and more newly added reload_model functions should be added as endpoints to the API extension, I don't think their scope should be limited to just developing extensions from within this UI only. The SD web UI exposing endpoints in it's own web UI is the only reason the SD API extension is possible in the first place.