[Bug]: Inconsistent stability when changing from V1.5 model to Inpainting V1.5 model.
Is there an existing issue for this?
- [X] I have searched the existing issues and checked the recent builds/commits
What happened?
About 60% of the time I try to change from V1.5 to Inpainting my GPU throws a cuda error about not enough allocated space or my system complete freezes up and my GPU gets hotter and hotter and hotter until I force my system off.
From my limited understanding it seems like the program is trying to load the V1.5 inpainting model without unloading the currently selected model from VRAM.
Steps to reproduce the problem
Load WEBUI and try to switch to inpainting model. I usually run the regular V1.5 model a few times before switching to inpainting, I'm not sure if that is required to produce the problem but it's what I would be doing.
What should have happened?
It should have given me 101 million dollars but instead it crashed my GPU, lol
Commit where the problem happens
606519813dd998140a741096f9029c732ee52d2a
What platforms do you use to access UI ?
Windows
What browsers do you use to access the UI ?
Mozilla Firefox
Command Line Arguments
--xformers --medvram
Additional information, context and logs
I don't have fancy specs but I had no issue switching other models consistently without issues before. The issue has only appeared with the V1.5 models.
Also want to note: sometimes when it does work after switching to the inpainting model it will throw cuda allocated space errors when trying to generate anything. I believe this is because 2 models are loaded onto the GPU because If I load just the inpainting model (as the first model loaded during a fresh restart) it works fine.
Processor:AMD Ryzen 7 2700 Eight-Core Processor 3.20 GHz Installed RAM:16.0 GB GPU: 2060RTX 6gig
TY for everything, I wish I was smart enough to help more.
After turning off xformers the problem isn't happening, will report back if it does.
getting the same issue without xformers
I'm also seeing extremely high RAM usage when switching models on --use-cpu=all. Maybe a duplicate of #2180.
Changing to an inpainting model is calling the load_model() and creating a new model, but the previous model is not being removed from memory, even calling gc.collect() is not removing the old model from memory.
So if you keep changing from inpainting to not inpainting or vice versa the leak keep increasing.
I also think this inpainting hijack logic should be improved, it is hijacking even when it isn't an inpainting model, it always call do_inpainting_hijack() and that method doesn't check anything.
do_inpainting_hijack just applies the code changes from the Runway repo: https://github.com/runwayml/stable-diffusion. It doesn't change any model behavior, and at some point the current stable-diffusion requirement should be updated to just point to that repo if the Compvis one doesn't get updates.
This is a more fundamental problem of how to deal with loading a model with a different config, not just the inpainting model. Any model with the same config can just update the weight in-place and memory usage is preserved. The fact that gc.collect() doesn't clear the old model is interesting however. This means that something is keeping a pointer to the old model alive and preventing it from being cleaned up.
I guess the following could be a stop-gap. This will first move the current model's weights to the cpu to free up gpu memory.
--- a/modules/sd_models.py
+++ b/modules/sd_models.py
@@ -255,6 +255,8 @@ def reload_model_weights(sd_model, info=None):
if sd_model.sd_checkpoint_info.config != checkpoint_info.config or should_hijack_inpainting(checkpoint_info) != should_hijack_inpainting(sd_model.sd_checkpoint_info):
checkpoints_loaded.clear()
+ shared.sd_model.cpu()
+ del shared.sd_model
load_model(checkpoint_info)
return shared.sd_model
@random-thoughtss I tried with those lines to delete the sd_model but it didn't worked, it must be something else.
About the inpainting logic, the LatentInpaintDiffusion just defines the properties masked_image_key and concat_keys that aren't being used anywhere, in the runwayml/stable-diffusion the LatentInpaintDiffusion also overwrites the get_input method and use those properties there.
I may be wrong but it looks like the LatentInpaintDiffusion in this repository is doing nothing.
Also if the do_inpainting_hijack() is a global hijack not only to inpainting it should be inside sd_hijack and be named properly to what it does.
All this code added to enable the runwayml inpainting looks like a workaround for a more generic problem: to use models that differ from the default sd models and need a custom config, what may kind of already exists, adding a yaml file next to the ckpt file will use it as the config file, if it is working with the current code so the LatentInpaintDiffusion is useless and it just needs a yaml file with the correct parameters.
So, for me a better solution would be, if the global codes are an improvement it should be added to sd_hijack and a yaml file for the config should be used for the runwayml and added along with the runwayml inpainting ckpt, as it is for other ckpt files that requires custom config files.
About the loading leak that concerns this issue, it needs more research to find the bug.
Edit:
Just confirmed, the runwayml inpainting works removing the inpainting hijack and adding a yaml file with the correct conditioning_key and in_channels.
@jn-jairo The changes were made to make it compatible with the official in-painting model config: https://github.com/runwayml/stable-diffusion/blob/main/configs/stable-diffusion/v1-inpainting-inference.yaml. The only problem is that this config is located in a separate repo from the checkpoint weights and in a pretty obscure spot. Most people using the model (including you it seems) have no idea this config exists, so we made the decision to hardcore the changes for now to ease installation. Perhaps we should ask the Runway if they could include config along with the model.
I do agree that the name should probably be updated to do_runway_hijack at this point, but its probably better to just switch to the runway repo at some point instead of keeping the differences locally. The only risk is the Compvis and Runway codebases could diverge if Compvis decides to continue development.
I tried with those lines to delete the sd_model but it didn't worked
What exactly didn't work? Are you still taking up GPU memory, or are the models taking more CPU memory? The .cpu() call moves the model weights to the CPU inplace, so they should not take up more vram. If that does not fix it, then something else is leaking into VRAM, not the model weights. One thing to note, If you are tracking the memory usage using nvidia-smi, is that torch does not immediately return GPU memory when it is freed, and it reuses the parts its already allocated but freed to create new tensors.
@random-thoughtss it is the RAM not the VRAM, the RAM increases every time load_model() is called, and because the leak is the same size of the model I thought it should be the model.
I know it is hard to find the yaml file, but a better option should be to include a list of yaml files in this repo and to have a configuration to enable/disable the match of the config by the name of the file, like:
configs/sd-v1-5-inpainting.yaml
models/Stable-diffusion/sd-v1-5-inpainting.ckpt
That way we can add the yaml files for the popular models to help the user that want it, without hard coding the changes.
I've been experiencing this on model swap. My whole system becomes unresponsive as SD takes up the system ram. I believe Python is failing to trigger GC soon enough and the system slows or locks. I have a workaround that seems to be working. Using prlimit to set the process max mem usage to about 80% of my available RAM (16 GB * 0.8) and running with the nice command.
$ prlimit --as=1669578752 $ nice bash webui.sh
Watching the system with "top" I can see python coming up to the limit, then goes back down without tanking the system.
This has helped, but isn't 100%. Loading v1-5 pruned is still big and likely to tank.
More testing. I increased my swap file from 2 MiB to 8 MiB. This allowed me to switch models three times. The fourth switch locked the system. I believe the models are not being released from memory on swap.
Just to notify the progress I made, It is indeed a reference problem, some places are keeping a reference of the model, what prevents the garbage collector to free the memory.
I am checking it with ctypes.c_long.from_address(id(shared.sd_model)).value and there are multiples references.
I am eliminating the references but there are still some to find, It will take a while to find everything.
@jn-jairo With your PRs getting merged, is this issue resolved now? Or are there still more issues here still?
- https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/4098
- https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/4142
Same thing happens, even with fresh installed Automatic1111
I think I've figured it out. The issue only happens if you run a batch of two images or more before switching checkpoints. When you run a batch with two or more images, TorchHijack takes a reference to the current sampler and doesn't let go when the batch is done. Unfortunately, the sampler being referenced also has a reference to the current checkpoint, so none of them can be garbage collected until TorchHijack lets go of the sampler.
#5065 should resolve this.