[Bug]: Unable to use Lora with unsplit prompts (<76 tokens), tries to allocate 20-30GB of VRAM
Is there an existing issue for this?
- [X] I have searched the existing issues and checked the recent builds/commits
What happened?
When using an unsplit prompt (<76 tokens) and a Lora, a huge amount of VRAM (20-30GB) is allocated causing out of memory error (on my 12GB VRAM GPU).
a photo of a person <lora:asoulEileen_v10:0.9>
OutOfMemoryError: CUDA out of memory. Tried to allocate 28.97 GiB
Forcing the prompt to split allows for generations to work without issue (two or more BREAKs are required).
a photo of a person BREAK BREAK <lora:asoulEileen_v10:0.9>
Time taken: 4m 40.96sTorch active/reserved: 9861/10760 MiB, Sys VRAM: 11927/12028 MiB (99.16%)
Steps to reproduce the problem
- Use a prompt that us <76 tokens
- Set parameters to use close to VRAM capacity (e.g. resolution, batch size)
- Add a Lora to the prompt
- Out of memory errors by a huge margin
What should have happened?
- Use a prompt that us <76 tokens
- Set parameters to use close to VRAM capacity (e.g. resolution, batch size)
- Add a Lora to the prompt
- Images generated
Commit where the problem happens
515bd85a015d2269d9e3c45ce88a0f4f7e965807
What platforms do you use to access the UI ?
Linux
What browsers do you use to access the UI ?
Google Chrome
Command Line Arguments
--listen --enable-console-prompts
List of extensions
a1111-sd-webui-tagcomplete clip-interrogator-ext custom-diffusion-webui depthmap2mask embedding-inspector model-keyword multi-subject-render openpose-editor sd-dynamic-prompts sd-infinity-grid-generator-script SD-latent-mirroring sd_smartprocess sdweb-merge-board sd-webui-ar sd-webui-controlnet seed_travel shift-attention stable-diffusion-webui stable-diffusion-webui-dataset-tag-editor stable-diffusion-webui-depthmap-script stable-diffusion-webui-images-browser stable-diffusion-webui-inspiration stable-diffusion-webui-prompt-travel stable-diffusion-webui-sonar stable-diffusion-webui-two-shot stable-diffusion-webui-wd14-tagger test_my_prompt training-picker ultimate-upscale-for-automatic1111 video_loopback_for_webui
Console logs
$ ./webui.sh
################################################################
Install script for stable-diffusion + Web UI
Tested on Debian 11 (Bullseye)
################################################################
################################################################
Running on xyem user
################################################################
################################################################
Repo already cloned, using it as install directory
################################################################
################################################################
Create and activate python venv
################################################################
################################################################
Launching launch.py...
################################################################
Python 3.10.7 (main, Sep 6 2022, 21:22:27) [GCC 12.2.0]
Commit hash: 515bd85a015d2269d9e3c45ce88a0f4f7e965807
Installing requirements for Web UI
Installing imageio-ffmpeg requirement for depthmap script
loading Smart Crop reqs from /home/xyem/sd/extensions/sd_smartprocess/requirements.txt
Checking Smart Crop requirements.
Installing sd-dynamic-prompts requirements.txt
Launching Web UI with arguments: --listen --enable-console-prompts
2023-03-09 00:19:19.735662: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-09 00:19:20.497577: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/xyem/sd/venv/lib/python3.10/site-packages/cv2/../../lib64:
2023-03-09 00:19:20.497668: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/xyem/sd/venv/lib/python3.10/site-packages/cv2/../../lib64:
2023-03-09 00:19:20.497683: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
No module 'xformers'. Proceeding without it.
Error loading script: training_picker.py
Traceback (most recent call last):
File "/home/xyem/sd/modules/scripts.py", line 229, in load_scripts
script_module = script_loading.load_module(scriptfile.path)
File "/home/xyem/sd/modules/script_loading.py", line 11, in load_module
module_spec.loader.exec_module(module)
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/home/xyem/sd/extensions/training-picker/scripts/training_picker.py", line 16, in <module>
from modules.ui import create_refresh_button, folder_symbol
ImportError: cannot import name 'folder_symbol' from 'modules.ui' (/home/xyem/sd/modules/ui.py)
Loading weights [abbb28cb5e] from /home/xyem/sd/models/Stable-diffusion/elysium/Elysium_V1.ckpt
Creating model from config: /home/xyem/sd/configs/v1-inference.yaml
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Applying cross attention optimization (Doggettx).
Textual inversion embeddings loaded(2175): [snipped for brevity]
Model loaded in 34.0s (load weights from disk: 18.1s, create model: 0.6s, apply weights to model: 1.2s, apply half(): 1.1s, load VAE: 5.1s, move model to device: 0.6s, load textual inversion embeddings: 7.3s).
patched in extra network ui page: deltas
patched in extra network: deltas
Running on local URL: http://0.0.0.0:7860
To create a public link, set `share=True` in `launch()`.
txt2img: photo of a person BREAK BREAK <lora:asoulEileen_v10:0.9>
100%|| 25/25 [00:46<00:00, 1.85s/it]
100%|| 25/25 [02:40<00:00, 6.41s/it]
Total progress: 100%|| 50/50 [04:31<00:00, 5.44s/it]
txt2img: photo of a person <lora:asoulEileen_v10:0.9>
100%|| 25/25 [00:35<00:00, 1.42s/it]
0%| | 0/25 [00:01<?, ?it/s]
Error completing request
Arguments: ('task(s9zw7vx3stn9go1)', 'photo of a person <lora:asoulEileen_v10:0.9> ', 'EasyNegative', [], 25, 0, True, False, 1, 12, 7, -1.0, -1.0, 0, 0, 0, False, 512, 320, True, 0.7, 2, 'Latent', 0, 600, 960, [], 0, 0, 0, 0, 0, 0.25, False, 'keyword prompt', 'keyword1, keyword2', 'None', 'textual inversion first', 'None', '0.7', 'None', False, False, 1, False, False, False, 1.1, 1.5, 100, 0.7, False, False, True, False, False, 0, 'Gustavosta/MagicPrompt-Stable-Diffusion', '', False, 'none', 'None', 1, None, False, 'Scale to Fit (Inner Fit)', False, False, 64, 64, 64, 1, False, False, '1:1,1:2,1:2', '0:0,0:0,0:1', '0.2,0.8,0.8', 20, False, False, 'positive', 'comma', 0, False, False, '', 1, '', 0, '', 0, '', True, False, False, False, 0, '', 5, 24, 12.5, 1000, 'DDIM', 0, 64, 64, '', 64, 7.5, 0.42, 'DDIM', 64, 64, 1, 0, 92, True, True, True, False, False, False, 'midas_v21_small', False, True, False, True, True, 'Create in UI', False, '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', False, 4.0, '', 10.0, 'Linear', 3, False, True, 30.0, True, False, False, 0, 0.0, 'Lanczos', 1, True, 10.0, True, 30.0, True, 0.0, 'Lanczos', 1, 0, 0, 512, 512, False, False, True, True, True, False, False, 1, False, False, 2.5, 4, 0, False, 0, 1, False, False, 'u2net', False, False, False, False, '{inspiration}', None, 'linear', 'lerp', 'token', 'random', '30', 'fixed', 1, '8', None, 'Lanczos', 2, 0, 0, 'mp4', 10.0, 0, '', True, False, False, 'Euler a', 0.95, 0.75, 'zero', 'pos', 'linear', 0.01, 0.0, 0.75, None, 'Lanczos', 1, 0, 0, 'Positive', 0, ', ', 'Generate and always save', 32) {}
Traceback (most recent call last):
File "/home/xyem/sd/modules/call_queue.py", line 56, in f
res = list(func(*args, **kwargs))
File "/home/xyem/sd/modules/call_queue.py", line 37, in f
res = func(*args, **kwargs)
File "/home/xyem/sd/modules/txt2img.py", line 56, in txt2img
processed = process_images(p)
File "/home/xyem/sd/modules/processing.py", line 486, in process_images
res = process_images_inner(p)
File "/home/xyem/sd/modules/processing.py", line 632, in process_images_inner
samples_ddim = p.sample(conditioning=c, unconditional_conditioning=uc, seeds=seeds, subseeds=subseeds, subseed_strength=p.subseed_strength, prompts=prompts)
File "/home/xyem/sd/modules/processing.py", line 902, in sample
samples = self.sampler.sample_img2img(self, samples, noise, conditioning, unconditional_conditioning, steps=self.hr_second_pass_steps or self.steps, image_conditioning=image_conditioning)
File "/home/xyem/sd/modules/sd_samplers_kdiffusion.py", line 322, in sample_img2img
samples = self.launch_sampling(t_enc + 1, lambda: self.func(self.model_wrap_cfg, xi, extra_args=extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs))
File "/home/xyem/sd/modules/sd_samplers_kdiffusion.py", line 225, in launch_sampling
return func()
File "/home/xyem/sd/modules/sd_samplers_kdiffusion.py", line 322, in <lambda>
samples = self.launch_sampling(t_enc + 1, lambda: self.func(self.model_wrap_cfg, xi, extra_args=extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs))
File "/home/xyem/sd/venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/xyem/sd/repositories/k-diffusion/k_diffusion/sampling.py", line 145, in sample_euler_ancestral
denoised = model(x, sigmas[i] * s_in, **extra_args)
File "/home/xyem/sd/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xyem/sd/modules/sd_samplers_kdiffusion.py", line 117, in forward
x_out = self.inner_model(x_in, sigma_in, cond={"c_crossattn": [cond_in], "c_concat": [image_cond_in]})
File "/home/xyem/sd/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xyem/sd/repositories/k-diffusion/k_diffusion/external.py", line 112, in forward
eps = self.get_eps(input * c_in, self.sigma_to_t(sigma), **kwargs)
File "/home/xyem/sd/repositories/k-diffusion/k_diffusion/external.py", line 138, in get_eps
return self.inner_model.apply_model(*args, **kwargs)
File "/home/xyem/sd/modules/sd_hijack_utils.py", line 17, in <lambda>
setattr(resolved_obj, func_path[-1], lambda *args, **kwargs: self(*args, **kwargs))
File "/home/xyem/sd/modules/sd_hijack_utils.py", line 28, in __call__
return self.__orig_func(*args, **kwargs)
File "/home/xyem/sd/repositories/stable-diffusion-stability-ai/ldm/models/diffusion/ddpm.py", line 858, in apply_model
x_recon = self.model(x_noisy, t, **cond)
File "/home/xyem/sd/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xyem/sd/repositories/stable-diffusion-stability-ai/ldm/models/diffusion/ddpm.py", line 1329, in forward
out = self.diffusion_model(x, t, context=cc)
File "/home/xyem/sd/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xyem/sd/extensions/sd-webui-controlnet/scripts/hook.py", line 190, in forward2
return forward(*args, **kwargs)
File "/home/xyem/sd/extensions/sd-webui-controlnet/scripts/hook.py", line 160, in forward
h = module(h, emb, context)
File "/home/xyem/sd/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xyem/sd/repositories/stable-diffusion-stability-ai/ldm/modules/diffusionmodules/openaimodel.py", line 84, in forward
x = layer(x, context)
File "/home/xyem/sd/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xyem/sd/repositories/stable-diffusion-stability-ai/ldm/modules/attention.py", line 324, in forward
x = block(x, context=context[i])
File "/home/xyem/sd/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xyem/sd/repositories/stable-diffusion-stability-ai/ldm/modules/attention.py", line 259, in forward
return checkpoint(self._forward, (x, context), self.parameters(), self.checkpoint)
File "/home/xyem/sd/repositories/stable-diffusion-stability-ai/ldm/modules/diffusionmodules/util.py", line 114, in checkpoint
return CheckpointFunction.apply(func, len(inputs), *args)
File "/home/xyem/sd/repositories/stable-diffusion-stability-ai/ldm/modules/diffusionmodules/util.py", line 129, in forward
output_tensors = ctx.run_function(*ctx.input_tensors)
File "/home/xyem/sd/repositories/stable-diffusion-stability-ai/ldm/modules/attention.py", line 262, in _forward
x = self.attn1(self.norm1(x), context=context if self.disable_self_attn else None) + x
File "/home/xyem/sd/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xyem/sd/modules/sd_hijack_optimizations.py", line 127, in split_cross_attention_forward
s1 = einsum('b i d, b j d -> b i j', q[:, i:end], k)
File "/home/xyem/sd/venv/lib/python3.10/site-packages/torch/functional.py", line 378, in einsum
return _VF.einsum(equation, operands) # type: ignore[attr-defined]
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 28.97 GiB (GPU 0; 11.75 GiB total capacity; 3.87 GiB already allocated; 3.29 GiB free; 7.31 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Additional information
This happens with other Lora, not just the one given in the example.
Just as an additional note, I can generate images without a split prompt providing I reduce the batch size significantly (e.g. 4).
Just tested adding --opt-split-attention-v1 as suggested in #8409 and it does indeed "fix the issue", allowing for the expected batch size (12) with an unsplit prompt. Hope this helps isolate the cause.
yeh use v1 , which should be default , sadly isnt
Closing as stale.