stable-diffusion-webui icon indicating copy to clipboard operation
stable-diffusion-webui copied to clipboard

[Bug]: Linux gets unresponsive after several generations (RAM)

Open tzwel opened this issue 2 years ago • 45 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues and checked the recent builds/commits

What happened?

After several generations, RAM skyrockets, thus making the system unresponsive and a restart is needed. OS: Manjaro linux GPU: RX 6600xt

Steps to reproduce the problem

  1. Launch webui
  2. Press Generate several times
  3. Watch what happens with the memory usage, and restart your pc
  4. Repeat

What should have happened?

system shouldn't crash

Commit where the problem happens

e0e80050091ea7f58ae17c69f31d1b5de5e0ae20

What platforms do you use to access UI ?

Linux

What browsers do you use to access the UI ?

Mozilla Firefox

Command Line Arguments

--precision full --no-half --medvram

Additional information, context and logs

i wonder if it can even be reproduced. is this a memory leak? the ram usage is normal until some random point which it decides to crash the system. i observed it going up very slowly each generation, but the usage increase was negligible, sorts of 0.5-1% i would upload a screenshot, but it's basically 99% when it happens

update: terminal sometimes just closes instead of crashing the system

tzwel avatar Jan 17 '23 14:01 tzwel

Are you switching models at all?

westmancurtis avatar Jan 17 '23 16:01 westmancurtis

Are you switching models at all?

not at all

tzwel avatar Jan 17 '23 19:01 tzwel

Yeah I've been having this issue for several weeks now, ever since the new gradio update.

rn i'm running --share --xformers --opt-channelslast --allow-code --enable-insecure-extension-access --gradio-debug.

I've tried adding and removing--precision full --no-half --no-half-vae --lowram --medvram --opt-split-attention, --allow-code.

Happens on Chrome as well, I'm running it on google colab free.

Seems to be fine when running normal res inference, but after I'm done doing an upres, it fails to respond or when after many batch counts.

Update: Before I've only tried opening gradio in chrome, but google colab was still on firefox. I've tried running colab in microsoft edge and opening gradio in there as well, the problem seems so have been fixed.

ArcticBeat05 avatar Jan 18 '23 00:01 ArcticBeat05

i have trouble understanding you, please specify are you or linux? are you using the webui locally on your gpu?

i'm about to try switching the browser, but this seems very odd, because edge runs on chromium, just like chrome

tzwel avatar Jan 18 '23 09:01 tzwel

Im also experiencing system hangs. Haven't been watching system resources close, but i will and ill update this post. Ubuntu 22.04 , fresh install, using anything-v4 on a 5700xt. This is my first day experimenting with the A1111 webui so i can't tell how old it is, but i can tell you I've hung about 4 times in the last 2 hours.

Edit:after playing around all day, i note my sysram always goes up after each image generation. Culminating in a crash once i ate it all up. I installed a pagefile and that seemed to calm the crashing down at the cost of some hiccups, but i could still crash it if i ate the pagefile up too. It never goes back down, however, until i close terminal/the program. Toggling a bunch of settings all day did nothing to change that. Then i removed the --medvram argument from startup and noticed my sysram never went up after each consecutive generation. The only other change at the same time i made was stopping images from being automatically saved, so hopefully its not that and i am wrong. I assume the swapping the program does to keep vram usage low is somehow the culprit? Idk im not a nerd.

As far as i can tell, removing the medvram argument stops the memory leak. But if that doesn't, try disabling automatic image saving.

If you need any more info just respond to this thread or however you call someones attention on this website.

Dnak-jb avatar Jan 18 '23 16:01 Dnak-jb

are you or linux?, are you using the webui locally on your gpu?

I'm on windows 10, but use Firefox. Used Google colabs GPU/CPU. Once I opened up google colab on chromium (edge) instead of firefox, it fixed the problem of unresponsive "generate" button. I have no idea why it fixed it, I believe its firefox related. This has been broken for me for several weeks, but I didnt think to say anything.

Now my model wont even load on firefox, I recommend to start using chrome/edge for now.

ArcticBeat05 avatar Jan 18 '23 17:01 ArcticBeat05

this is very weird. I'm observing more ram usage over time even when I'm not generating anything.

tzwel avatar Jan 18 '23 21:01 tzwel

I found this issue #2858 that seems to reference the same problem as mine i tried downgrading gradio and it seemed to help a little bit, but the problem returned now I removed the --precision full --no-half parameters from webui.user.sh and it seems to work the ram is getting clogged up very slowly (if even), and it doesn't skyrocket after generations

i won't close the issue yet until i confirm it fixes the problem for someone else

tzwel avatar Jan 23 '23 15:01 tzwel

ra well

tzwel avatar Jan 23 '23 15:01 tzwel

Adding a swapfile seems to make the issue less annoying

sudo swapoff -a
sudo dd if=/dev/zero of=/swapfile bs=1M count=8192
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

add /swapfile none swap sw 0 0 to /etc/fstab

grep SwapTotal /proc/meminfo

check if it worked

tzwel avatar Feb 02 '23 05:02 tzwel

I'm also experiencing this

Done without changing the model or adding any embeds/hypernets/loras to the prompt before first generation: total 17279676K

Generations 1: 21838172K 2: 23507288K 3: 23553092K 4: 23647104K 5: 23653284K 6: 24031884K 7: 26616812K 8: 29185140K 9: 26011716K 10: 26501696K 11: 29935684K (this one had a NaN VAE exception before it finished) 12: 26752832K 13: 28358212K 14: 28398896K 15: KILLED OOM

It seems to only do this once you click generate. If I do a batch of 15 images it only goes up slightly compared to a single image.

Edit:

specs: OS: Artix Linux kernel 6.1.7-zen1-1-zen RAM: 16GB (and a 4GB swap) GPU: RX 6400 XT 4GB Python 3.10.8 torch: 1.13.1+rocm5.2 Launch opts: --medvram

lsaa avatar Feb 03 '23 17:02 lsaa

is it possible that, this memory leak is coming from the hashing code? I haven't looked into it deeply, but I'm on gradio 3.16.2 and after adding --no-hashing which added recently,having less problems I think.

claychinasky avatar Feb 05 '23 00:02 claychinasky

updated and tried --no-hashing it seems better but still runs out of memory. The behavior is slightly different this time: it remains using a constant amount of memory between generations but after a while it starts going up like it used to.

lsaa avatar Feb 05 '23 03:02 lsaa

I am having the same issue. Running without --medvram and am not noticing an increase in used RAM on my system, so it could be the way that the system is transferring data back and forth between system RAM and vRAM, and is failing to clear out the ram as it goes. I am also on Linux and have not tested on Windows.

killacan avatar Feb 05 '23 18:02 killacan

I am having the same issue. Running without --medvram and am not noticing an increase in used RAM on my system, so it could be the way that the system is transferring data back and forth between system RAM and vRAM, and is failing to clear out the ram as it goes. I am also on Linux and have not tested on Windows.

I'm running without --medvram and never ran that argument but having a leak issue. It may be a separate issue as well. I'm on ubuntu 22.04, gradio 3.4.1 and --no-hashing, this is the at least settings to get minimal leaking.

claychinasky avatar Feb 05 '23 19:02 claychinasky

Ok so I've trying to pinpoint what causes the memory to jump and I found a few things out. I'm on ea9bd9f

without-no-hashing.txt with-no-hashing.txt without --no-hashing it's better than it used to be but still increases over time. however after the first OOM Kill I decided to try something out: on each generation I swap out a TI embed. Seems like that makes fill up memory a lot faster. Honestly could be unrelated since I've had it leak without using TI embeds but having it go up 1.5GB after loading an embed might be a bug

lsaa avatar Feb 05 '23 20:02 lsaa

I can confirm that the problem lies in --medvram for my case

vram

I'm running without --medvram and never ran that argument but having a leak issue. It may be a separate issue as well.

might be

tzwel avatar Feb 09 '23 08:02 tzwel

I'm having this problem as well. I use --lowvram and I can generate up to 3-4 images until my desktop crashes. In my case I can't run AUTOMATIC1111 without the --lowvram argument so I can't test if that's the problem.

notdelicate avatar Feb 09 '23 12:02 notdelicate

running without --medvram fixed it for me as well

edit: tested it a bit more. Seems to be very stable however I can only generate smaller pics due to not having --medvram.

lsaa avatar Feb 09 '23 12:02 lsaa

how do you use your VAEs? I might be onto something i put them in the VAE directory and i'm not noticing sudden spikes anymore this could be a coincidence, i'll test it more

tzwel avatar Feb 12 '23 23:02 tzwel

the RAM usage goes up VERY SLOWLY now, i think i am close to finding the cause, but I will need to verify it

tzwel avatar Feb 13 '23 00:02 tzwel

how do you use your VAEs? I might be onto something i put them in the VAE directory and i'm not noticing sudden spikes anymore this could be a coincidence, i'll test it more

tested it and changing it to the VAE folder might have done something but it still crashed after 78 generations. without loading in any embeds or loras I used to get around 55 generations so either this is an outlier or it actually made a difference

lsaa avatar Feb 13 '23 02:02 lsaa

I'm starting to think this issue might be related to graphics driver / pytorch / xformers / kernel, this is linux and Nvidia after all. (I'm on 525 latest) This might as well separate issue too. Because I have used some other repos to generate which uses pytorch and xformers, after a while my swap is filled, had to reset the swap few times.

claychinasky avatar Feb 13 '23 02:02 claychinasky

I'm starting to think this issue might be related to graphics driver / pytorch / xformers / kernel, this is linux and Nvidia after all. (I'm on 525 latest) This might as well separate issue too. Because I have used some other repos to generate which uses pytorch and xformers, after a while my swap is filled, had to reset the swap few times.

i'm on AMD but you bring a good point, I'll try to use something else like comfyui to see if it also causes RAM buildup

lsaa avatar Feb 13 '23 02:02 lsaa

it crashed and logged me out now, i don't know why it does that, but after logging back the issue seems to be less annoying

tzwel avatar Feb 13 '23 12:02 tzwel

Details about my own testing of the memory leak on my system

  • Env: Linux (Ubuntu), miniconda, Python 3.10.8, Torch 1.13.1+cu117, currently commit 3715ece0 intel CPU, RTX 3090
  • Has been happening for at least the past couple months, if not since always.
  • RAM leak, not VRAM. Monitored via Linux top
  • Happens both with and without xformers
  • Happens without any extensions
  • Happens when launching with a basically empty commandline_args
  • Memory goes up if/when I: change model, then generate image.
    • It does not go up if I change models without generating images
    • It does not go up if I generate multiple images with one model
  • Goes up in roughly equivalent rate to the size of the model (ie a few gigs per model switch, some models raise it higher than ones)
  • Happens with both ckpt and safetensors models
  • Happens even with SAFETENSORS_FAST_GPU=1
  • Happens without any embeddings loaded
  • Happens both with a manual VAE and with VAE set to Automatic (I don't have .vae.pt files on models so this is equivalent to None)
  • The relevant memory usage is not visible to tracemalloc, implying it's in the lower level Torch libs(?)
  • Most intriguingly, does not happen when using the XYZ Plot script's Checkpoint name option to go through a list of models.

I suspect the root of the issue is models loaded into torch are remaining loaded in system RAM after switching away from them.

I suspect the secret to locating the source of the bug lies in investigating what that XYZ plot script does differently from normal generations. Perhaps it bypasses some stage of processing somewhere?

mcmonkey4eva avatar Feb 13 '23 13:02 mcmonkey4eva

I suspect the secret to locating the source of the bug lies in investigating what that XYZ plot script does differently from normal generations. Perhaps it bypasses some stage of processing somewhere?

yesterday I was testing large batch count generations. Ran out of memory at the very end of a 100 BC 2 BS gen (no xyz plot), exactly when it should've been generating the txt2img-grid, can confirm that all 200 pics generated successfully and the grid image is not there. On the XYZ plot the grid is not generated in the usual way and the only image added to the result gallery is the plot itself.

2023-02-13_00-27

Currently trying to see if i can avoid the leak on my machine by using the xyz plot. will probably edit later

Edit: Still have a leak while using only the xyz plot script to generate images. No embeds/loras/hypernets or any extensions that aren't built-in. Only arg is --medvram

lsaa avatar Feb 13 '23 17:02 lsaa

It's quite likely there are multiple different memory leaks going on, or multiple variants of one root leak.

I note that the leak I narrowed down myself relates to model loading, and the --medvram and --lowvram arguments cause model data to be loaded and unloaded repeatedly while running, which could well be same root cause but different symptoms.

If the root leak is related to the Torch internal code that transfer data between CPU and GPU, that which perfectly explain why medvram/lowvram seem to make it worse, and also explain why precision full makes it worse too.

mcmonkey4eva avatar Feb 14 '23 16:02 mcmonkey4eva

OK so I'm also pretty confident its an issue on the torch backend. I tried this fix out https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/6722 and it's running perfectly. note fore arch users: gperftools is built against GLIBCXX 3.4.30 and the arch repos are behind, get an older version from the archives. gperftools 2.9 works for me

lsaa avatar Feb 15 '23 21:02 lsaa

That thread fixed it!

sudo apt install libgoogle-perftools-dev then add export LD_PRELOAD=libtcmalloc.so in webui-user.sh

I'm now able to repeat my earlier test and memory grows to 33% of available RAM then stops growing.

mcmonkey4eva avatar Feb 16 '23 03:02 mcmonkey4eva