stable-diffusion-webui-forge Add option to keep model in VRAM instead of unloading it after each generation

I am running the FP16 version of Flux and the fp16 T5 text encoder on my RTX 2060 laptop with 32 GB RAM. I was surprised to see WebUI forge having faster speeds by multiple magnitudes compared to Comfy (11 minutes vs 2 minutes), so great job on the optimization here, @lllyasviel !

However, running it in FP16 is really tight on my RAM as well so loading parts of the model into VRAM takes quite a bit of time. When I press generation, moving models adds around 1 minute to the generation time.

So it would be really cool to have an option to turn this behavior off. Once loaded, the model should stay in VRAM until I close the program. This way there's be no moving models process between generations and it would speed up the experience up by a lot. Please consider it.

Aug 17 '24 19:08 Dampfinchen

Is it due to the T5-xxl? I was looking for an option to keep model in VRAM in settings when I came across this message:

I wish there was an option to choose which models to keep in VRAM and which to offload to RAM. I have 24GB of VRAM, and with GGUF-Q8, my VRAM usage is never full, yet each time I change the prompt, I see the VRAM unloading then loading. Sometimes it gives me the OOM error, which I have to hit the Generate button multiple times before it starts working again.

I noticed sometimes that the VRAM usage varies randomly, sometimes its around 14GB and sometimes its 22GB. Once it shot passed the VRAM into the Shared GPU Memory. This led me to think that maybe models are loaded randomly into VRAM and depending one whether the Unet is first loaded or last, OOM can happen.

Aug 17 '24 23:08 Iory1998

I just used ComfyUI, it seems that models are now kept in VRAM and generation with same prompt takes 30s for me while changing the prompt takes 47.47s (AR: 832x1216, Steps:20, commit:bb222ce). I am using the fp16 and I have an RTX3090. It's likely a matter of time before @lllyasviel implement it.

Aug 18 '24 00:08 Iory1998

I just used ComfyUI, it seems that models are now kept in VRAM and generation with same prompt takes 30s for me while changing the prompt takes 47.47s (AR: 832x1216, Steps:20, commit:bb222ce). I am using the fp16 and I have an RTX3090. It's likely a matter of time before @lllyasviel implement it.

Indeed this is very frustrating, I only have 8gb but even when I do have enough it seems to load an unload a lot

Aug 19 '24 17:08 andy8992

I just used ComfyUI, it seems that models are now kept in VRAM and generation with same prompt takes 30s for me while changing the prompt takes 47.47s (AR: 832x1216, Steps:20, commit:bb222ce). I am using the fp16 and I have an RTX3090. It's likely a matter of time before @lllyasviel implement it.

Indeed this is very frustrating, I only have 8gb but even when I do have enough it seems to load an unload a lot

Use the GGUF Q8 if you can, it's 99% identical to the FP16. If you can't, try the Q6 version and the T5xxl fp8, which takes half the size of the VRAM. Remember, you need an equivalent in VRAM of the Unet + CLip+T5xx+LoRA+ControlNet model.

Aug 19 '24 17:08 Iory1998

Ah I just meant I was getting constant model movement even in xl

Aug 19 '24 18:08 andy8992

also found this unpredictable behavior of loading/unloading of models on my 3090. maybe some checkbox to "keep model in vram" would help?

Aug 19 '24 20:08 tazztone

Yes, that model offloading is pretty annoying and it's not stable od rtx 3060 12gb and 32gb ram. I'm searching for a way how to turnoff offloading.

Aug 22 '24 12:08 mase-sk

The issue is in memory_management. Line 621.

if loaded_model in current_loaded_models:

the loaded_model is always different that what is in current_loaded_models. current_loaded_models-> <backend.memory_management.LoadedModel object at 0x000001C72885BC10> to load= <backend.memory_management.LoadedModel object at 0x000001C72C08ADD0> Maybe some hash usage could solve that, i'v spend some time on that, but i don't know enough classes, was trying to check like in sd_models with if model_data.forge_hash == current_hash: but i've never found equivalent. hope @lllyasviel could fix it, it should be quite easy with those information ?

Sep 08 '24 02:09 Arnaud3013

The issue is in memory_management. Line 621.
if loaded_model in current_loaded_models:
the loaded_model is always different that what is in current_loaded_models. current_loaded_models-> <backend.memory_management.LoadedModel object at 0x000001C72885BC10> to load= <backend.memory_management.LoadedModel object at 0x000001C72C08ADD0> Maybe some hash usage could solve that, i'v spend some time on that, but i don't know enough classes, was trying to check like in sd_models with if model_data.forge_hash == current_hash: but i've never found equivalent. hope @lllyasviel could fix it, it should be quite easy with those information ?

I agree. I notice long weeks ago my Vram usage would vary drastically (from 12GB to 24+ and use shared memory) to generate the same exact prompt. I could see that Memory Management tries to load and unload models until they fit somehow. I deduced from that that the models get loaded randomly into Vram. I think that's why each time I change LoRa, my Vram gets unloaded entirely and get loaded again. Why would you unload the Flux.1 model and text encoders too?

Sep 09 '24 00:09 Iory1998

since the last couple of changes it seems to have gotten better. i can now load flux q8 and t5 q8 and keep it in vram between generations

Sep 09 '24 06:09 tazztone

Still having issue on my side. I've done a git pull and a fore reset head, no change. Always 0 models to keep loaded.

which version are you using? Did more tests, in SDXL no issue, it is just with Flux (NF4 or Q4)

Sep 09 '24 08:09 Arnaud3013

Yes, that model offloading is pretty annoying and it's not stable od rtx 3060 12gb and 32gb ram. I'm searching for a way how to turnoff offloading.

Same here with RTX 3060 12 but with 16GB of RAM, with a Flux Dev NF4 model I converted. It's not like I plan to use machine between generations, it would be really nice to keep the model in VRAM and/or RAM to keep things smooth.

Feb 17 '25 13:02 sgtlighttree

This seems to be the only place with this information and I've tried for like 6 hours to get this "keep model loaded" and it will not work

I have "Keep models in VRAM" enabled and it always does a 8 second generation, then swap to another model before the next generation when doing batches

How are people able to get it to say "1 models kept loaded" mine always says 0 and I've done tons of searching trying to find the solution. Tried the "Enable T5 in memory" and the "keep one on device" options too.

Please help like the guy up above seems to have the solution but where was this setting made or done?

I'm on a commit from february 2025 way after this post, so why is it not working?

Commit hash: 4a30c157691a33c2cc6e8d4fe861907429428f1e

Edit: Today I lowered GPU weights and it is working finally, seems to be related to that since if I put it at 6500 it works but over 7000 says keep 0 loaded so is a problem with my VRAM not the software. Glad it works!

####################################################################################| 4/4 [00:12<00:00, 3.23s/it] [Unload] Trying to free 4287.94 MB for cuda:0 with 1 models keep loaded ... Current free memory is 5652.91 MB ... Done. Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored.

Oct 04 '25 00:10 mirandaandnicole