stable-diffusion-webui icon indicating copy to clipboard operation
stable-diffusion-webui copied to clipboard

[Bug]: linux memory leak when switching models

Open jacquesfeng123 opened this issue 1 year ago • 13 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues and checked the recent builds/commits

What happened?

I have set up a server for my team to use.

Config is as below: { "sd_checkpoint_cache": 0, "sd_vae_checkpoint_cache": 0, } however every time I switch a model, RAM increases. It never gets down unless the webui is killed/restarted.

This is observed on linux only, not on my windows installation. In the end, I have to kill the Linux server every night.

I have upgraded to Torch 2.0.0, but the same thing was observed upgrading. Also I multi-instanced my gpu server to 4 webui instances. Same thing is also observed on my T4 (single instance).

we start with this, preloaded with a safetensor model A. please look at the second line, this number is going to change. image

in the ui, we switched the model to model B. image

now we switch it back to model A. image

now we switch to model B again. image

this keeps on happening even on the same models, so no need to prove further with other/more models. the issue is that this continues untill OOM, which then freezes the entire server. We already have 200 GB of RAM, including 100GB of swap. but sigh. It could be great if this can be solved.

this issue has been mentioned #2180, someone in #7451, this remained unanswered. also in #6532, it seems it has been fixed, but it really hasn't.

Steps to reproduce the problem

  1. install in linux
  2. install all required components
  3. add in two or more models
  4. switch between models
  5. observe VRAM fly high

What should have happened?

with no cache in model, switchign models should not increase RAM.

Commit where the problem happens

python: 3.8.10  •  torch: 2.0.0+cu118  •  xformers: N/A  •  gradio: 3.22.0  •  commit: faeef6fc  •  checkpoint: 4a408d2491

What platforms do you use to access the UI ?

Linux

What browsers do you use to access the UI ?

Google Chrome

Command Line Arguments

--api --listen

List of extensions

controlnet imagebrowser systeminfo

Console logs

no log error until OOM

Additional information

No response

jacquesfeng123 avatar Apr 03 '23 11:04 jacquesfeng123

Me too, it always increases every time the model is switched, even if it is the one used before model. --xformers --opt-split-attention --no-half-vae --medvram.
So I added some parameters and found them to be of little use

Qhao6 avatar Apr 04 '23 01:04 Qhao6

I get this too.

"webui.sh" killed

when switching models every so often

dejl avatar Apr 05 '23 03:04 dejl

I get this too. For me it seems that roughly the full size of the model leaks into CPU RAM every time I switch models. Need to restart the python server frequently when switching models to prevent this. Reproduction is very consistent. Just switch models, generate one image, and switch models again.

Eventually, OOM will cause system instability eventually followed by webui.sh being killed.

AstralCupid avatar Apr 05 '23 07:04 AstralCupid

I am experiencing the exact same issue. Sometimes webui.sh gets killed after consuming all memory, sometimes my X session freezes and I have to reboot the entire thing.

manulsoftware avatar Apr 05 '23 08:04 manulsoftware

can confirm I get the same issue running docker desktop+WSL2, assign 14 gigs of ram, switch models a few times and observe ram go up until the container stops responding/crashes.

possible reasons?

  • WSL2 memory leak with pytorch stuff
  • torch isn't unloading models properly on linux
  • xformers is modifying the model or something to optimize it as it loads and this new model reference is never released? (only the old unoptimized one could be getting released?)
  • something up with new torch 2.0?

Nyxeka avatar Apr 05 '23 18:04 Nyxeka

if it helps I'm running on Debian, not in a docker container Using torch 2.0.0 as well

python: 3.10.6  •  torch: 2.0.0+cu118  •  xformers: 0.0.18  •  gradio: 3.23.0  •  commit: [22bcc7be](https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/22bcc7be428c94e9408f589966c2040187245d81)

dejl avatar Apr 05 '23 21:04 dejl

Dunno if might help, but on collab I'm using....

wget -qq --show-progres http://launchpadlibrarian.net/367274644/libgoogle-perftools-dev_2.5-2.2ubuntu3_amd64.deb
wget -qq --show-progres https://launchpad.net/ubuntu/+source/google-perftools/2.5-2.2ubuntu3/+build/14795286/+files/google-perftools_2.5-2.2ubuntu3_all.deb
wget -qq --show-progres https://launchpad.net/ubuntu/+source/google-perftools/2.5-2.2ubuntu3/+build/14795286/+files/libtcmalloc-minimal4_2.5-2.2ubuntu3_amd64.deb
wget -qq --show-progres https://launchpad.net/ubuntu/+source/google-perftools/2.5-2.2ubuntu3/+build/14795286/+files/libgoogle-perftools4_2.5-2.2ubuntu3_amd64.deb
apt install -qq libunwind8-dev
dpkg -i *.deb
rm *.deb
os.environ["LD_PRELOAD"] = "libtcmalloc.so"

This fixed Mem leak issues on colab. Maybe this can be used as reference.

NamelessButler avatar Apr 05 '23 23:04 NamelessButler

I think this might be the source of my recent memory leak problems. Killing webui doesn't free all the consume ram either. Started after upgrading forward in to the gradio update, was previously on https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/a9fed7c364061ae6efb37f797b6b522cb3cf7aa2

Kadah avatar Apr 06 '23 07:04 Kadah

me too, ubuntu 22.04

nntaoli avatar Apr 06 '23 13:04 nntaoli

@falsonerd Thanks for sharing your solution, it worked perfectly for me in the webui! I added export LD_PRELOAD=/usr/lib/libtcmalloc.so to the bash script I use to run launch.py and now memory doesn't increase when I switch checkpoints. They also load a LOT faster.

Here's the full script I run to launch the webui from a virtual env:

#!/usr/bin/env bash

export LD_PRELOAD=/usr/lib/libtcmalloc.so

env VIRTUAL_ENV=/var/lib/sdwebui/stable-diffusion-webui/venv /var/lib/sdwebui/stable-diffusion-webui/venv/bin/python launch.py

For any Arch Linux users looking to apply this fix, /usr/lib/libtcmalloc.so is part of the gperftools package.

prurigro avatar Apr 07 '23 01:04 prurigro

Seems libtcmalloc does help.

Ubuntu 20.04: Install libtcmalloc-minimal4 via apt Add export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4 to web-user.sh

Seeing about half as much bloating after swaping through a dozen checkpoint. Loading seem to be about the same (when switching to a checkpoint that had been recently loaded, even in a previous instance, it'll load faster due to OS disk caching).

I'll need to reboot to see if this resolves the permanent mem leak I'm seeing within the first couple hours of booting and running the webui.

image

Kadah avatar Apr 07 '23 07:04 Kadah

Dunno if might help, but on collab I'm using....

wget -qq --show-progres http://launchpadlibrarian.net/367274644/libgoogle-perftools-dev_2.5-2.2ubuntu3_amd64.deb
wget -qq --show-progres https://launchpad.net/ubuntu/+source/google-perftools/2.5-2.2ubuntu3/+build/14795286/+files/google-perftools_2.5-2.2ubuntu3_all.deb
wget -qq --show-progres https://launchpad.net/ubuntu/+source/google-perftools/2.5-2.2ubuntu3/+build/14795286/+files/libtcmalloc-minimal4_2.5-2.2ubuntu3_amd64.deb
wget -qq --show-progres https://launchpad.net/ubuntu/+source/google-perftools/2.5-2.2ubuntu3/+build/14795286/+files/libgoogle-perftools4_2.5-2.2ubuntu3_amd64.deb
apt install -qq libunwind8-dev
dpkg -i *.deb
rm *.deb
os.environ["LD_PRELOAD"] = "libtcmalloc.so"

This fixed Mem leak issues on colab. Maybe this can be used as reference.

this works, thanks!

jacquesfeng123 avatar Apr 11 '23 06:04 jacquesfeng123

image I switch back and forth between the controllnet model, and the memory will continue to rise until it explodes。 I have already used libtcmalloc.so

gunjianpanxdd avatar Apr 14 '23 15:04 gunjianpanxdd

having the same issue on mint 21, 32gb ram with a 8gb swap file system used to grind to a halt when it ate all my ram and swap

increased swap to 16gb thinking I hadn't set enough, it ate that too and caused a lock up had to ctrl alt backspace to kill my session.

just installed the libtcmalloc fix that Kadah mentioned earlier. seems to be ok at the moment. will report back in about an hour or two if my system locks up

salmon85 avatar Apr 20 '23 20:04 salmon85

The device is more stable now. I hade once a System freeze after switching models a lot. RAM Usage stays lower than before.

If you want to use this on Fedora 38 you have to:

  1. sudo dnf install sudo dnf install gperftools-2.9.1-5.fc38.x86_64 (or your actual version that is available)
  2. create in your stable-diffusion folder your custom batch file like Custom.sh and make it executable (right click on the file)
  3. open the file with an editor and type
#!/usr/bin/env bash
python3.10 -m venv env
source env/bin/activate
export LD_PRELOAD=/usr/lib64/libtcmalloc.so
python launch.py --xformers --autolaunch --theme dark

notes: line: python3.10 -m venv env = needed for the correct python version line: export LD_PRELOAD=/usr/lib64/libtcmalloc.so = for loading the RAM fix line: python launch.py --xformers --autolaunch --theme dark = my setup with xfromers

pikatchu2k3 avatar Apr 21 '23 07:04 pikatchu2k3

using libtcmalloc, I found the RAM usage goes down on its own over time, though the container can still build up and crash if you switch models quickly.

Nyxeka avatar Apr 21 '23 13:04 Nyxeka

Seems libtcmalloc does help.

Ubuntu 20.04: Install libtcmalloc-minimal4 via apt Add export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4 to web-user.sh

Seeing about half as much bloating after swaping through a dozen checkpoint. Loading seem to be about the same (when switching to a checkpoint that had been recently loaded, even in a previous instance, it'll load faster due to OS disk caching).

I'll need to reboot to see if this resolves the permanent mem leak I'm seeing within the first couple hours of booting and running the webui.

image

Update: The leak from swapping models appears to be mostly fixed by using libtcmalloc, but I still have no clue on the cause of the mystery leak over time from just having it run idle. That one is worse as just restarting the webui does not free the mem, only rebooting will.

Kadah avatar Apr 24 '23 03:04 Kadah

Can confirm facing the same issue. Switching to libtcmalloc was a fix on Ubuntu 23.04. I would like to profile into which part of the server caused such a problem, anyone with a hint?

mack-w avatar Apr 28 '23 09:04 mack-w

The device is more stable now. I hade once a System freeze after switching models a lot. RAM Usage stays lower than before.

If you want to use this on Fedora 38 you have to:

1. sudo dnf install sudo dnf install gperftools-2.9.1-5.fc38.x86_64 (or your actual version that is available)

2. create in your stable-diffusion folder your custom batch file like Custom.sh and make it executable (right click on the file)

3. open the file with an editor and type
#!/usr/bin/env bash
python3.10 -m venv env
source env/bin/activate
export LD_PRELOAD=/usr/lib64/libtcmalloc.so
python launch.py --xformers --autolaunch --theme dark

notes: line: python3.10 -m venv env = needed for the correct python version line: export LD_PRELOAD=/usr/lib64/libtcmalloc.so = for loading the RAM fix line: python launch.py --xformers --autolaunch --theme dark = my setup with xfromers

For anyone using Debian 11:

sudo apt install google-perftools and/or sudo apt install libtcmalloc-minimal4 File locations are different, so this worked for me:

#!/bin/bash
#!/usr/bin/env bash
python3 -m venv env
source env/bin/activate
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4
python3 launch.py --listen --no-half --medvram --upcast-sampling

And that should do the trick.

DutchComputerKid avatar Apr 29 '23 20:04 DutchComputerKid

using libtcmalloc, I found the RAM usage goes down on its own over time, though the container can still build up and crash if you switch models quickly.

Seconding this. tcmalloc does not fix the issue completely. If you switch models frequent enough, it crashes with an oom.

bigahega avatar May 02 '23 14:05 bigahega

There are enough models out there that switching between models on both Windows and Linux can cause memory leaks.

wangwenqiao666 avatar May 18 '23 07:05 wangwenqiao666

Well, this is not about linux specifically, same thing happens on windows. It's painful actually, to restart webui every couple of minutes. Any fix? This problem was first discovered several updates ago and still here we are.

mockinbirdy avatar May 30 '23 18:05 mockinbirdy

we can us su root command into root, then input echo 3 > /proc/sys/vm/drop_caches clear cache

wangwenqiao666 avatar Jun 01 '23 09:06 wangwenqiao666

Same problem here. Some (but not all) issues that seem to be about the same problem:

https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/8377 https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/7451 https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/8394 https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/5691 https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/5550 https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/2858 https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/5250 https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/2180 https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/234

MakingMadness avatar Jun 11 '23 23:06 MakingMadness

No solution in sight for this problem?

dungeon1103 avatar Jul 15 '23 18:07 dungeon1103

it's even worse with sdxl, I have to restart the webui every two gens to not freeze the entire system...

hrkrx avatar Jul 28 '23 09:07 hrkrx

Just to echo that, with Stable Diffusion XL it's now common to switch between checkpoints. Once for base and once for refine. Doing so a few times, or switching to another checkpoint causes memory to shoot up and frequently get killed for out-of-memory.

My system - Ubuntu 22.04, 32GB RAM.
Launched automatic1111 with ./webui.sh --medvram

dmesg:

19253.424833] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/[email protected]/app.slice/app-org.gnome.Terminal.slice/vte-spawn-a6c45a68-500f-4286-9184-934514323b61.scope,task=python3,pid=159469,uid=1000 [19253.424897] Out of memory: Killed process 159469 (python3) total-vm:50746396kB, anon-rss:25113304kB, file-rss:71448kB, shmem-rss:16520kB, UID:1000 pgtables:64628kB oom_score_adj:0

image

mendhak avatar Jul 28 '23 22:07 mendhak

I also encountered the same problem. Is there any way to solve it

lhw11 avatar Jul 30 '23 03:07 lhw11

Same on Windows I have a swap of 128gigs, which helps a bit, but still, it's shooting up to 80-90gigs easily

towardmastered avatar Jul 31 '23 08:07 towardmastered

Same happens to me, using libtcmalloc_minimal.so.4 on linux, it's not a major problem as long as I don't switch models, I haven't tried waiting a long time before switching models. As of right now it pretty much makes the workflow with the latest stable diffusion XL model (base + refiner) really hard to use.

Nan-Do avatar Aug 03 '23 13:08 Nan-Do