stable-diffusion-webui
stable-diffusion-webui copied to clipboard
[Bug]: linux memory leak when switching models
Is there an existing issue for this?
- [X] I have searched the existing issues and checked the recent builds/commits
What happened?
I have set up a server for my team to use.
Config is as below: { "sd_checkpoint_cache": 0, "sd_vae_checkpoint_cache": 0, } however every time I switch a model, RAM increases. It never gets down unless the webui is killed/restarted.
This is observed on linux only, not on my windows installation. In the end, I have to kill the Linux server every night.
I have upgraded to Torch 2.0.0, but the same thing was observed upgrading. Also I multi-instanced my gpu server to 4 webui instances. Same thing is also observed on my T4 (single instance).
we start with this, preloaded with a safetensor model A. please look at the second line, this number is going to change.
in the ui, we switched the model to model B.
now we switch it back to model A.
now we switch to model B again.
this keeps on happening even on the same models, so no need to prove further with other/more models. the issue is that this continues untill OOM, which then freezes the entire server. We already have 200 GB of RAM, including 100GB of swap. but sigh. It could be great if this can be solved.
this issue has been mentioned #2180, someone in #7451, this remained unanswered. also in #6532, it seems it has been fixed, but it really hasn't.
Steps to reproduce the problem
- install in linux
- install all required components
- add in two or more models
- switch between models
- observe VRAM fly high
What should have happened?
with no cache in model, switchign models should not increase RAM.
Commit where the problem happens
python: 3.8.10 • torch: 2.0.0+cu118 • xformers: N/A • gradio: 3.22.0 • commit: faeef6fc • checkpoint: 4a408d2491
What platforms do you use to access the UI ?
Linux
What browsers do you use to access the UI ?
Google Chrome
Command Line Arguments
--api --listen
List of extensions
controlnet imagebrowser systeminfo
Console logs
no log error until OOM
Additional information
No response
Me too, it always increases every time the model is switched, even if it is the one used before model.
--xformers --opt-split-attention --no-half-vae --medvram.
So I added some parameters and found them to be of little use
I get this too.
"webui.sh" killed
when switching models every so often
I get this too. For me it seems that roughly the full size of the model leaks into CPU RAM every time I switch models. Need to restart the python server frequently when switching models to prevent this. Reproduction is very consistent. Just switch models, generate one image, and switch models again.
Eventually, OOM will cause system instability eventually followed by webui.sh being killed.
I am experiencing the exact same issue. Sometimes webui.sh gets killed after consuming all memory, sometimes my X session freezes and I have to reboot the entire thing.
can confirm I get the same issue running docker desktop+WSL2, assign 14 gigs of ram, switch models a few times and observe ram go up until the container stops responding/crashes.
possible reasons?
- WSL2 memory leak with pytorch stuff
- torch isn't unloading models properly on linux
- xformers is modifying the model or something to optimize it as it loads and this new model reference is never released? (only the old unoptimized one could be getting released?)
- something up with new torch 2.0?
if it helps I'm running on Debian, not in a docker container Using torch 2.0.0 as well
python: 3.10.6 • torch: 2.0.0+cu118 • xformers: 0.0.18 • gradio: 3.23.0 • commit: [22bcc7be](https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/22bcc7be428c94e9408f589966c2040187245d81)
Dunno if might help, but on collab I'm using....
wget -qq --show-progres http://launchpadlibrarian.net/367274644/libgoogle-perftools-dev_2.5-2.2ubuntu3_amd64.deb
wget -qq --show-progres https://launchpad.net/ubuntu/+source/google-perftools/2.5-2.2ubuntu3/+build/14795286/+files/google-perftools_2.5-2.2ubuntu3_all.deb
wget -qq --show-progres https://launchpad.net/ubuntu/+source/google-perftools/2.5-2.2ubuntu3/+build/14795286/+files/libtcmalloc-minimal4_2.5-2.2ubuntu3_amd64.deb
wget -qq --show-progres https://launchpad.net/ubuntu/+source/google-perftools/2.5-2.2ubuntu3/+build/14795286/+files/libgoogle-perftools4_2.5-2.2ubuntu3_amd64.deb
apt install -qq libunwind8-dev
dpkg -i *.deb
rm *.deb
os.environ["LD_PRELOAD"] = "libtcmalloc.so"
This fixed Mem leak issues on colab. Maybe this can be used as reference.
I think this might be the source of my recent memory leak problems. Killing webui doesn't free all the consume ram either. Started after upgrading forward in to the gradio update, was previously on https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/a9fed7c364061ae6efb37f797b6b522cb3cf7aa2
me too, ubuntu 22.04
@falsonerd Thanks for sharing your solution, it worked perfectly for me in the webui! I added export LD_PRELOAD=/usr/lib/libtcmalloc.so
to the bash script I use to run launch.py and now memory doesn't increase when I switch checkpoints. They also load a LOT faster.
Here's the full script I run to launch the webui from a virtual env:
#!/usr/bin/env bash
export LD_PRELOAD=/usr/lib/libtcmalloc.so
env VIRTUAL_ENV=/var/lib/sdwebui/stable-diffusion-webui/venv /var/lib/sdwebui/stable-diffusion-webui/venv/bin/python launch.py
For any Arch Linux users looking to apply this fix, /usr/lib/libtcmalloc.so
is part of the gperftools
package.
Seems libtcmalloc
does help.
Ubuntu 20.04:
Install libtcmalloc-minimal4
via apt
Add export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4
to web-user.sh
Seeing about half as much bloating after swaping through a dozen checkpoint. Loading seem to be about the same (when switching to a checkpoint that had been recently loaded, even in a previous instance, it'll load faster due to OS disk caching).
I'll need to reboot to see if this resolves the permanent mem leak I'm seeing within the first couple hours of booting and running the webui.
Dunno if might help, but on collab I'm using....
wget -qq --show-progres http://launchpadlibrarian.net/367274644/libgoogle-perftools-dev_2.5-2.2ubuntu3_amd64.deb wget -qq --show-progres https://launchpad.net/ubuntu/+source/google-perftools/2.5-2.2ubuntu3/+build/14795286/+files/google-perftools_2.5-2.2ubuntu3_all.deb wget -qq --show-progres https://launchpad.net/ubuntu/+source/google-perftools/2.5-2.2ubuntu3/+build/14795286/+files/libtcmalloc-minimal4_2.5-2.2ubuntu3_amd64.deb wget -qq --show-progres https://launchpad.net/ubuntu/+source/google-perftools/2.5-2.2ubuntu3/+build/14795286/+files/libgoogle-perftools4_2.5-2.2ubuntu3_amd64.deb apt install -qq libunwind8-dev dpkg -i *.deb rm *.deb os.environ["LD_PRELOAD"] = "libtcmalloc.so"
This fixed Mem leak issues on colab. Maybe this can be used as reference.
this works, thanks!
I switch back and forth between the controllnet model, and the memory will continue to rise until it explodes。
I have already used libtcmalloc.so
having the same issue on mint 21, 32gb ram with a 8gb swap file system used to grind to a halt when it ate all my ram and swap
increased swap to 16gb thinking I hadn't set enough, it ate that too and caused a lock up had to ctrl alt backspace to kill my session.
just installed the libtcmalloc fix that Kadah mentioned earlier. seems to be ok at the moment. will report back in about an hour or two if my system locks up
The device is more stable now. I hade once a System freeze after switching models a lot. RAM Usage stays lower than before.
If you want to use this on Fedora 38 you have to:
- sudo dnf install sudo dnf install gperftools-2.9.1-5.fc38.x86_64 (or your actual version that is available)
- create in your stable-diffusion folder your custom batch file like Custom.sh and make it executable (right click on the file)
- open the file with an editor and type
#!/usr/bin/env bash
python3.10 -m venv env
source env/bin/activate
export LD_PRELOAD=/usr/lib64/libtcmalloc.so
python launch.py --xformers --autolaunch --theme dark
notes: line: python3.10 -m venv env = needed for the correct python version line: export LD_PRELOAD=/usr/lib64/libtcmalloc.so = for loading the RAM fix line: python launch.py --xformers --autolaunch --theme dark = my setup with xfromers
using libtcmalloc, I found the RAM usage goes down on its own over time, though the container can still build up and crash if you switch models quickly.
Seems
libtcmalloc
does help.Ubuntu 20.04: Install
libtcmalloc-minimal4
via apt Addexport LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4
toweb-user.sh
Seeing about half as much bloating after swaping through a dozen checkpoint. Loading seem to be about the same (when switching to a checkpoint that had been recently loaded, even in a previous instance, it'll load faster due to OS disk caching).
I'll need to reboot to see if this resolves the permanent mem leak I'm seeing within the first couple hours of booting and running the webui.
Update: The leak from swapping models appears to be mostly fixed by using libtcmalloc, but I still have no clue on the cause of the mystery leak over time from just having it run idle. That one is worse as just restarting the webui does not free the mem, only rebooting will.
Can confirm facing the same issue. Switching to libtcmalloc
was a fix on Ubuntu 23.04.
I would like to profile into which part of the server caused such a problem, anyone with a hint?
The device is more stable now. I hade once a System freeze after switching models a lot. RAM Usage stays lower than before.
If you want to use this on Fedora 38 you have to:
1. sudo dnf install sudo dnf install gperftools-2.9.1-5.fc38.x86_64 (or your actual version that is available) 2. create in your stable-diffusion folder your custom batch file like Custom.sh and make it executable (right click on the file) 3. open the file with an editor and type
#!/usr/bin/env bash python3.10 -m venv env source env/bin/activate export LD_PRELOAD=/usr/lib64/libtcmalloc.so python launch.py --xformers --autolaunch --theme dark
notes: line: python3.10 -m venv env = needed for the correct python version line: export LD_PRELOAD=/usr/lib64/libtcmalloc.so = for loading the RAM fix line: python launch.py --xformers --autolaunch --theme dark = my setup with xfromers
For anyone using Debian 11:
sudo apt install google-perftools
and/or sudo apt install libtcmalloc-minimal4
File locations are different, so this worked for me:
#!/bin/bash
#!/usr/bin/env bash
python3 -m venv env
source env/bin/activate
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4
python3 launch.py --listen --no-half --medvram --upcast-sampling
And that should do the trick.
using libtcmalloc, I found the RAM usage goes down on its own over time, though the container can still build up and crash if you switch models quickly.
Seconding this. tcmalloc does not fix the issue completely. If you switch models frequent enough, it crashes with an oom.
There are enough models out there that switching between models on both Windows and Linux can cause memory leaks.
Well, this is not about linux specifically, same thing happens on windows. It's painful actually, to restart webui every couple of minutes. Any fix? This problem was first discovered several updates ago and still here we are.
we can us su root command into root, then input echo 3 > /proc/sys/vm/drop_caches clear cache
Same problem here. Some (but not all) issues that seem to be about the same problem:
https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/8377 https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/7451 https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/8394 https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/5691 https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/5550 https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/2858 https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/5250 https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/2180 https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/234
No solution in sight for this problem?
it's even worse with sdxl, I have to restart the webui every two gens to not freeze the entire system...
Just to echo that, with Stable Diffusion XL it's now common to switch between checkpoints. Once for base and once for refine. Doing so a few times, or switching to another checkpoint causes memory to shoot up and frequently get killed for out-of-memory.
My system - Ubuntu 22.04, 32GB RAM.
Launched automatic1111 with ./webui.sh --medvram
dmesg:
19253.424833] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/[email protected]/app.slice/app-org.gnome.Terminal.slice/vte-spawn-a6c45a68-500f-4286-9184-934514323b61.scope,task=python3,pid=159469,uid=1000 [19253.424897] Out of memory: Killed process 159469 (python3) total-vm:50746396kB, anon-rss:25113304kB, file-rss:71448kB, shmem-rss:16520kB, UID:1000 pgtables:64628kB oom_score_adj:0
I also encountered the same problem. Is there any way to solve it
Same on Windows I have a swap of 128gigs, which helps a bit, but still, it's shooting up to 80-90gigs easily
Same happens to me, using libtcmalloc_minimal.so.4 on linux, it's not a major problem as long as I don't switch models, I haven't tried waiting a long time before switching models. As of right now it pretty much makes the workflow with the latest stable diffusion XL model (base + refiner) really hard to use.