stable-diffusion-webui [Bug]: CUDA fragmentation issues.

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

CUDA runs out of memory.

Steps to reproduce the problem

Go to ....
Press ....
...

What should have happened?

Should clean its ownself up.

Commit where the problem happens

Today

What platforms do you use to access UI ?

W10

What browsers do you use to access the UI ?

Vivaldi

Command Line Arguments

--medvram --port 9000 --force-enable-xformers --vae-path "models\VAE\vae-ft-mse-840000-ema-pruned.pt"

Additional information, context and logs

I keep getting this and have to close auto1111 down then restart it and it works. The Broswer web page is left as is to rerun where I left off.

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 6.00 GiB total capacity; 5.28 GiB already allocated; 0 bytes free; 5.37 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

As of late this fragments my memory, but close it down (ctrl-c) and restart it I am good for a bit.

Nov 07 '22 20:11 DarkAlchy

Wanted to add it seems to happen when I do img2img.

Dug a bit more and see it is due to gradients being used when it hits me (i2i or t2i).

File "D:\stable-diffusion-webui\modules\ui.py", line 185, in f res = list(func(*args, **kwargs)) File "D:\stable-diffusion-webui\webui.py", line 54, in f res = func(*args, **kwargs) File "D:\stable-diffusion-webui\modules\img2img.py", line 139, in img2img processed = process_images(p) File "D:\stable-diffusion-webui\modules\processing.py", line 423, in process_images res = process_images_inner(p) File "D:\stable-diffusion-webui\modules\processing.py", line 508, in process_images_inner uc = prompt_parser.get_learned_conditioning(shared.sd_model, len(prompts) * [p.negative_prompt], p.steps) File "D:\stable-diffusion-webui\modules\prompt_parser.py", line 138, in get_learned_conditioning conds = model.get_learned_conditioning(texts) File "D:\stable-diffusion-webui\repositories\stable-diffusion\ldm\models\diffusion\ddpm.py", line 558, in get_learned_conditioning c = self.cond_stage_model(c) File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "D:\stable-diffusion-webui\modules\sd_hijack.py", line 338, in forward z1 = self.process_tokens(tokens, multipliers) File "D:\stable-diffusion-webui\extensions\aesthetic-gradients\aesthetic_clip.py", line 237, in call optimizer.step() File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\optim\optimizer.py", line 113, in wrapper return func(*args, **kwargs) File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\optim\adam.py", line 144, in step state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format) RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 6.00 GiB total capacity; 5.28 GiB already allocated; 0 bytes free; 5.37 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Nov 07 '22 20:11 DarkAlchy

Does it, by any chance, happen for certain on a 1024 x 512 txt2img? Whereas a 512 x 512 tests 'okay?'

Nov 08 '22 04:11 jet3004

Does it, by any chance, happen for certain on a 1024 x 512 txt2img? Whereas a 512 x 512 tests 'okay?'

No, this happens regardless of the resolution I pick, and only began doing this recently.

Nov 08 '22 14:11 DarkAlchy

Status still flagged as "bug report" not yet "bug". CUDA OOM being an ongoing challenge due to torch-burdened memory leaks. Additionally the version is long outdated. I propose to close the issue and file a new issue report if such burning OOM issue should persist in the current version.

Jun 27 '23 10:06 TheOnlyHolyMoly

Why file anything since it is all just a cluster eff because the issue remains to the point there is no enjoyment for me at least having to constantly restart to clear out the lost mem. Someone needs to figure out what is going on OR does this speak to the issue of open source and the old "too many cooks spoil the broth" syndrome? If it does then I can see OSS being a huge negative.

Jun 27 '23 11:06 DarkAlchy

I agree to the point that OOM is annoying, I am suffering from it myself. I was explained by people that I hold for very knowlegable in that capacity that some great issues come from underlying components (beyond the repos reach) that are contributing to OOM experience. I also agree, that it might be beneficial if the repo would somehow have a general guidance what OOM incidents can be looked into further and what should be directed upstream (e.g. to the pytorch team). However, I do not expect any time soon a major change in OOM situations, maybe with different backends or updates of the backends coming up. That would give quicker clarity if the users should experience a solution for their OOM issue or just have to live with it until the circumstance of the backend change. Would you agree?

I guess I was just saying that is of limited use having an incident open since November stuck in process without any follow up. If the issue is pressing and important rather refile with respect to a recent version.

Jun 27 '23 12:06 TheOnlyHolyMoly

That is why I closed this as it was ancient and I now know it is a Pytorch KNOWN issue. I was hoping PT 2 would solve this, but it only became worse for me. Since I am 6gb it happens to me much faster than someone with more vram though it will eventually happen to them as well IF they use it enough.

Jun 27 '23 13:06 DarkAlchy

here, you'll love and appreciate this I hope: https://github.com/vladmandic/automatic/discussions/1357 It'll give you very solid and detailed background info.

Jun 27 '23 17:06 TheOnlyHolyMoly

Thing is this is a GPU mem leak as I haven't ran into a CPU mem leak ever even when I had 16 GB (have 48 now).

Jun 27 '23 18:06 DarkAlchy

Vlad's response was adressing memory leaks of different kinds in his response across the different components. You'll see that he added a section on VRAM add the end for the sake of completeness as the scope of my question was too narrow. You are always welcome to switch to SD.next and then place an OOM related issue there or just ask in advance there in discussions section if anyone has a realiable configuration for 6GB VRAM. The community is pretty active and supportive, I am sure you'll get some responses.

Jun 27 '23 19:06 TheOnlyHolyMoly

SD.next? I have never heard of that one.

Jun 27 '23 19:06 DarkAlchy

https://github.com/vladmandic/automatic

Jun 27 '23 21:06 TheOnlyHolyMoly

Oh, is that what he is calling it now? Thanks.

Jun 27 '23 21:06 DarkAlchy

stable-diffusion-webui stable-diffusion-webui copied to clipboard

[Bug]: CUDA fragmentation issues.

Is there an existing issue for this?

What happened?

Steps to reproduce the problem

What should have happened?

Commit where the problem happens

What platforms do you use to access UI ?

What browsers do you use to access the UI ?

Command Line Arguments

Additional information, context and logs

stable-diffusion-webui
stable-diffusion-webui copied to clipboard