stable-diffusion-webui [Bug]: Unable to use Lora with unsplit prompts (<76 tokens), tries to allocate 20-30GB of VRAM

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

When using an unsplit prompt (<76 tokens) and a Lora, a huge amount of VRAM (20-30GB) is allocated causing out of memory error (on my 12GB VRAM GPU).

a photo of a person <lora:asoulEileen_v10:0.9>
OutOfMemoryError: CUDA out of memory. Tried to allocate 28.97 GiB

Forcing the prompt to split allows for generations to work without issue (two or more BREAKs are required).

a photo of a person BREAK BREAK <lora:asoulEileen_v10:0.9>
Time taken: 4m 40.96sTorch active/reserved: 9861/10760 MiB, Sys VRAM: 11927/12028 MiB (99.16%)

Steps to reproduce the problem

Use a prompt that us <76 tokens
Set parameters to use close to VRAM capacity (e.g. resolution, batch size)
Add a Lora to the prompt
Out of memory errors by a huge margin

What should have happened?

Use a prompt that us <76 tokens
Set parameters to use close to VRAM capacity (e.g. resolution, batch size)
Add a Lora to the prompt
Images generated

Commit where the problem happens

515bd85a015d2269d9e3c45ce88a0f4f7e965807

What platforms do you use to access the UI ?

Linux

What browsers do you use to access the UI ?

Google Chrome

Command Line Arguments

--listen --enable-console-prompts

List of extensions

a1111-sd-webui-tagcomplete clip-interrogator-ext custom-diffusion-webui depthmap2mask embedding-inspector model-keyword multi-subject-render openpose-editor sd-dynamic-prompts sd-infinity-grid-generator-script SD-latent-mirroring sd_smartprocess sdweb-merge-board sd-webui-ar sd-webui-controlnet seed_travel shift-attention stable-diffusion-webui stable-diffusion-webui-dataset-tag-editor stable-diffusion-webui-depthmap-script stable-diffusion-webui-images-browser stable-diffusion-webui-inspiration stable-diffusion-webui-prompt-travel stable-diffusion-webui-sonar stable-diffusion-webui-two-shot stable-diffusion-webui-wd14-tagger test_my_prompt training-picker ultimate-upscale-for-automatic1111 video_loopback_for_webui

Console logs

$ ./webui.sh 

################################################################
Install script for stable-diffusion + Web UI
Tested on Debian 11 (Bullseye)
################################################################

################################################################
Running on xyem user
################################################################

################################################################
Repo already cloned, using it as install directory
################################################################


################################################################
Create and activate python venv
################################################################

################################################################
Launching launch.py...
################################################################



Python 3.10.7 (main, Sep  6 2022, 21:22:27) [GCC 12.2.0]
Commit hash: 515bd85a015d2269d9e3c45ce88a0f4f7e965807

Installing requirements for Web UI
Installing imageio-ffmpeg requirement for depthmap script

loading Smart Crop reqs from /home/xyem/sd/extensions/sd_smartprocess/requirements.txt
Checking Smart Crop requirements.
Installing sd-dynamic-prompts requirements.txt

Launching Web UI with arguments: --listen --enable-console-prompts
2023-03-09 00:19:19.735662: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-09 00:19:20.497577: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/xyem/sd/venv/lib/python3.10/site-packages/cv2/../../lib64:
2023-03-09 00:19:20.497668: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/xyem/sd/venv/lib/python3.10/site-packages/cv2/../../lib64:
2023-03-09 00:19:20.497683: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
No module 'xformers'. Proceeding without it.
Error loading script: training_picker.py
Traceback (most recent call last):
  File "/home/xyem/sd/modules/scripts.py", line 229, in load_scripts
    script_module = script_loading.load_module(scriptfile.path)
  File "/home/xyem/sd/modules/script_loading.py", line 11, in load_module
    module_spec.loader.exec_module(module)
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/xyem/sd/extensions/training-picker/scripts/training_picker.py", line 16, in <module>
    from modules.ui import create_refresh_button, folder_symbol
ImportError: cannot import name 'folder_symbol' from 'modules.ui' (/home/xyem/sd/modules/ui.py)

Loading weights [abbb28cb5e] from /home/xyem/sd/models/Stable-diffusion/elysium/Elysium_V1.ckpt
Creating model from config: /home/xyem/sd/configs/v1-inference.yaml
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Applying cross attention optimization (Doggettx).

Textual inversion embeddings loaded(2175): [snipped for brevity]

Model loaded in 34.0s (load weights from disk: 18.1s, create model: 0.6s, apply weights to model: 1.2s, apply half(): 1.1s, load VAE: 5.1s, move model to device: 0.6s, load textual inversion embeddings: 7.3s).
patched in extra network ui page: deltas
patched in extra network: deltas
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.

txt2img: photo of a person BREAK BREAK <lora:asoulEileen_v10:0.9> 
100%|| 25/25 [00:46<00:00,  1.85s/it]
100%|| 25/25 [02:40<00:00,  6.41s/it]
Total progress: 100%|| 50/50 [04:31<00:00,  5.44s/it]

txt2img: photo of a person <lora:asoulEileen_v10:0.9> 
100%|| 25/25 [00:35<00:00,  1.42s/it]
  0%|                                                                                                                                                                                                  | 0/25 [00:01<?, ?it/s]
Error completing request
Arguments: ('task(s9zw7vx3stn9go1)', 'photo of a person <lora:asoulEileen_v10:0.9> ', 'EasyNegative', [], 25, 0, True, False, 1, 12, 7, -1.0, -1.0, 0, 0, 0, False, 512, 320, True, 0.7, 2, 'Latent', 0, 600, 960, [], 0, 0, 0, 0, 0, 0.25, False, 'keyword prompt', 'keyword1, keyword2', 'None', 'textual inversion first', 'None', '0.7', 'None', False, False, 1, False, False, False, 1.1, 1.5, 100, 0.7, False, False, True, False, False, 0, 'Gustavosta/MagicPrompt-Stable-Diffusion', '', False, 'none', 'None', 1, None, False, 'Scale to Fit (Inner Fit)', False, False, 64, 64, 64, 1, False, False, '1:1,1:2,1:2', '0:0,0:0,0:1', '0.2,0.8,0.8', 20, False, False, 'positive', 'comma', 0, False, False, '', 1, '', 0, '', 0, '', True, False, False, False, 0, '', 5, 24, 12.5, 1000, 'DDIM', 0, 64, 64, '', 64, 7.5, 0.42, 'DDIM', 64, 64, 1, 0, 92, True, True, True, False, False, False, 'midas_v21_small', False, True, False, True, True, 'Create in UI', False, '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', False, 4.0, '', 10.0, 'Linear', 3, False, True, 30.0, True, False, False, 0, 0.0, 'Lanczos', 1, True, 10.0, True, 30.0, True, 0.0, 'Lanczos', 1, 0, 0, 512, 512, False, False, True, True, True, False, False, 1, False, False, 2.5, 4, 0, False, 0, 1, False, False, 'u2net', False, False, False, False, '{inspiration}', None, 'linear', 'lerp', 'token', 'random', '30', 'fixed', 1, '8', None, 'Lanczos', 2, 0, 0, 'mp4', 10.0, 0, '', True, False, False, 'Euler a', 0.95, 0.75, 'zero', 'pos', 'linear', 0.01, 0.0, 0.75, None, 'Lanczos', 1, 0, 0, 'Positive', 0, ', ', 'Generate and always save', 32) {}
Traceback (most recent call last):
  File "/home/xyem/sd/modules/call_queue.py", line 56, in f
    res = list(func(*args, **kwargs))
  File "/home/xyem/sd/modules/call_queue.py", line 37, in f
    res = func(*args, **kwargs)
  File "/home/xyem/sd/modules/txt2img.py", line 56, in txt2img
    processed = process_images(p)
  File "/home/xyem/sd/modules/processing.py", line 486, in process_images
    res = process_images_inner(p)
  File "/home/xyem/sd/modules/processing.py", line 632, in process_images_inner
    samples_ddim = p.sample(conditioning=c, unconditional_conditioning=uc, seeds=seeds, subseeds=subseeds, subseed_strength=p.subseed_strength, prompts=prompts)
  File "/home/xyem/sd/modules/processing.py", line 902, in sample
    samples = self.sampler.sample_img2img(self, samples, noise, conditioning, unconditional_conditioning, steps=self.hr_second_pass_steps or self.steps, image_conditioning=image_conditioning)
  File "/home/xyem/sd/modules/sd_samplers_kdiffusion.py", line 322, in sample_img2img
    samples = self.launch_sampling(t_enc + 1, lambda: self.func(self.model_wrap_cfg, xi, extra_args=extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs))
  File "/home/xyem/sd/modules/sd_samplers_kdiffusion.py", line 225, in launch_sampling
    return func()
  File "/home/xyem/sd/modules/sd_samplers_kdiffusion.py", line 322, in <lambda>
    samples = self.launch_sampling(t_enc + 1, lambda: self.func(self.model_wrap_cfg, xi, extra_args=extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs))
  File "/home/xyem/sd/venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/xyem/sd/repositories/k-diffusion/k_diffusion/sampling.py", line 145, in sample_euler_ancestral
    denoised = model(x, sigmas[i] * s_in, **extra_args)
  File "/home/xyem/sd/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xyem/sd/modules/sd_samplers_kdiffusion.py", line 117, in forward
    x_out = self.inner_model(x_in, sigma_in, cond={"c_crossattn": [cond_in], "c_concat": [image_cond_in]})
  File "/home/xyem/sd/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xyem/sd/repositories/k-diffusion/k_diffusion/external.py", line 112, in forward
    eps = self.get_eps(input * c_in, self.sigma_to_t(sigma), **kwargs)
  File "/home/xyem/sd/repositories/k-diffusion/k_diffusion/external.py", line 138, in get_eps
    return self.inner_model.apply_model(*args, **kwargs)
  File "/home/xyem/sd/modules/sd_hijack_utils.py", line 17, in <lambda>
    setattr(resolved_obj, func_path[-1], lambda *args, **kwargs: self(*args, **kwargs))
  File "/home/xyem/sd/modules/sd_hijack_utils.py", line 28, in __call__
    return self.__orig_func(*args, **kwargs)
  File "/home/xyem/sd/repositories/stable-diffusion-stability-ai/ldm/models/diffusion/ddpm.py", line 858, in apply_model
    x_recon = self.model(x_noisy, t, **cond)
  File "/home/xyem/sd/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xyem/sd/repositories/stable-diffusion-stability-ai/ldm/models/diffusion/ddpm.py", line 1329, in forward
    out = self.diffusion_model(x, t, context=cc)
  File "/home/xyem/sd/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xyem/sd/extensions/sd-webui-controlnet/scripts/hook.py", line 190, in forward2
    return forward(*args, **kwargs)
  File "/home/xyem/sd/extensions/sd-webui-controlnet/scripts/hook.py", line 160, in forward
    h = module(h, emb, context)
  File "/home/xyem/sd/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xyem/sd/repositories/stable-diffusion-stability-ai/ldm/modules/diffusionmodules/openaimodel.py", line 84, in forward
    x = layer(x, context)
  File "/home/xyem/sd/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xyem/sd/repositories/stable-diffusion-stability-ai/ldm/modules/attention.py", line 324, in forward
    x = block(x, context=context[i])
  File "/home/xyem/sd/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xyem/sd/repositories/stable-diffusion-stability-ai/ldm/modules/attention.py", line 259, in forward
    return checkpoint(self._forward, (x, context), self.parameters(), self.checkpoint)
  File "/home/xyem/sd/repositories/stable-diffusion-stability-ai/ldm/modules/diffusionmodules/util.py", line 114, in checkpoint
    return CheckpointFunction.apply(func, len(inputs), *args)
  File "/home/xyem/sd/repositories/stable-diffusion-stability-ai/ldm/modules/diffusionmodules/util.py", line 129, in forward
    output_tensors = ctx.run_function(*ctx.input_tensors)
  File "/home/xyem/sd/repositories/stable-diffusion-stability-ai/ldm/modules/attention.py", line 262, in _forward
    x = self.attn1(self.norm1(x), context=context if self.disable_self_attn else None) + x
  File "/home/xyem/sd/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xyem/sd/modules/sd_hijack_optimizations.py", line 127, in split_cross_attention_forward
    s1 = einsum('b i d, b j d -> b i j', q[:, i:end], k)
  File "/home/xyem/sd/venv/lib/python3.10/site-packages/torch/functional.py", line 378, in einsum
    return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 28.97 GiB (GPU 0; 11.75 GiB total capacity; 3.87 GiB already allocated; 3.29 GiB free; 7.31 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Additional information

This happens with other Lora, not just the one given in the example.

Mar 09 '23 00:03 Xyem

Just as an additional note, I can generate images without a split prompt providing I reduce the batch size significantly (e.g. 4).

Just tested adding --opt-split-attention-v1 as suggested in #8409 and it does indeed "fix the issue", allowing for the expected batch size (12) with an unsplit prompt. Hope this helps isolate the cause.

Mar 09 '23 10:03 Xyem

yeh use v1 , which should be default , sadly isnt

Mar 10 '23 11:03 2blackbar

Closing as stale.

Aug 26 '23 13:08 catboxanon