stable-diffusion-webui
stable-diffusion-webui copied to clipboard
[DO NOT MERGE] All perf improvements bundle
DO NOT MERGE THIS PR, merge individual PRs instead. This PR is for users to try out all performance improvements together.
Description
This is a bundle PR of all performance improvement PRs:
- https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/15803
- https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/15804
- https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/15805
- https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/15806
- https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/15816
- https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/15820
How to use
-
Checkout this PR
- In your A1111 repo directory, open terminal
- Use Github client to checkout this PR using command
gh pr checkout 15821
-
Add
--precision halfto your command line args if your GPU supports fp16 calculation.
Unpatch the PR
- In A1111 repo directory, open terminal
git checkout master
Expected performance improvement
For SDXL, this PR brings the performance from 580ms/it to 280ms/it on my machine. However this is only for Unet's denosing steps, not including all other factors such as VAE encode/decode and save the image, but overall you should expect at least 20% performance boost.
Report issues
Please report any bugs related to this batch of performance improvements to https://github.com/huchenlei/stable-diffusion-webui/issues
My tests on these PRs have limited coverage, so some features might get broken, and I would like to get these fixed before merging.
Checklist:
- [ ] I have read contributing wiki page
- [ ] I have performed a self-review of my own code
- [ ] My code follows the style guidelines
- [ ] My code passes tests
You can mark PR as draft to exclude accidentally merge
Clicks Generate
Traceback (most recent call last):
File "D:\AI\stable-diffusion-webui\modules\call_queue.py", line 57, in f
res = list(func(*args, **kwargs))
File "D:\AI\stable-diffusion-webui\modules\call_queue.py", line 36, in f
res = func(*args, **kwargs)
File "D:\AI\stable-diffusion-webui\modules\txt2img.py", line 109, in txt2img
processed = processing.process_images(p)
File "D:\AI\stable-diffusion-webui\modules\processing.py", line 839, in process_images
res = process_images_inner(p)
File "D:\AI\stable-diffusion-webui\extensions\sd-webui-controlnet\scripts\batch_hijack.py", line 59, in processing_process_images_hijack
return getattr(processing, '__controlnet_original_process_images_inner')(p, *args, **kwargs)
File "D:\AI\stable-diffusion-webui\modules\processing.py", line 975, in process_images_inner
samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, seeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=p.prompts)
File "D:\AI\stable-diffusion-webui\modules\processing.py", line 1322, in sample
samples = self.sampler.sample(self, x, conditioning, unconditional_conditioning, image_conditioning=self.txt2img_image_conditioning(x))
File "D:\AI\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 218, in sample
samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args=self.sampler_extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs))
File "D:\AI\stable-diffusion-webui\modules\sd_samplers_common.py", line 272, in launch_sampling
return func()
File "D:\AI\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 218, in <lambda>
samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args=self.sampler_extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs))
File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "D:\AI\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\sampling.py", line 594, in sample_dpmpp_2m
denoised = model(x, sigmas[i] * s_in, **extra_args)
File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\AI\stable-diffusion-webui\modules\sd_samplers_cfg_denoiser.py", line 237, in forward
x_out = self.inner_model(x_in, sigma_in, cond=make_condition_dict(cond_in, image_cond_in))
File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\AI\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\external.py", line 112, in forward
eps = self.get_eps(input * c_in, self.sigma_to_t(sigma), **kwargs)
File "D:\AI\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\external.py", line 138, in get_eps
return self.inner_model.apply_model(*args, **kwargs)
File "D:\AI\stable-diffusion-webui\modules\sd_hijack_utils.py", line 22, in <lambda>
setattr(resolved_obj, func_path[-1], lambda *args, **kwargs: self(*args, **kwargs))
File "D:\AI\stable-diffusion-webui\modules\sd_hijack_utils.py", line 34, in __call__
return self.__sub_func(self.__orig_func, *args, **kwargs)
File "D:\AI\stable-diffusion-webui\modules\sd_hijack_unet.py", line 48, in apply_model
result = orig_func(self, x_noisy.to(devices.dtype_unet), t.to(devices.dtype_unet), cond, **kwargs)
File "D:\AI\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\models\diffusion\ddpm.py", line 858, in apply_model
x_recon = self.model(x_noisy, t, **cond)
File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\AI\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\models\diffusion\ddpm.py", line 1335, in forward
out = self.diffusion_model(x, t, context=cc)
File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\AI\stable-diffusion-webui\modules\sd_unet.py", line 91, in UNetModel_forward
return original_forward(self, x, timesteps, context, *args, **kwargs)
File "D:\AI\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\openaimodel.py", line 797, in forward
h = module(h, emb, context)
File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\AI\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\openaimodel.py", line 86, in forward
x = layer(x)
File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\AI\stable-diffusion-webui\extensions\a1111-sd-webui-lycoris\l_networks.py", line 524, in network_Conv2d_forward
return originals.Conv2d_forward(self, input)
File "D:\AI\stable-diffusion-webui\extensions-builtin\Lora\networks.py", line 523, in network_Conv2d_forward
return originals.Conv2d_forward(self, input)
File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (float) and bias type (struct c10::Half) should be the same
Now what?
In my environment, the SDXL model failed to load because FP8 weight was set to enabled in the settings optimization.
When I disabled it, the SDXL model loaded without problems.
Getting similar error as @Gushousekai195 (no lycoris, happens even when all loras disabled). One of these patches is breaking SD1.5. Only SDXL works.
Edit: Narrowed it down to --precision half
SDXL, Nvidia 4090 + Intel 12700K - seeing a 22.04% increase in speed. No (noticeable?) effect on image output.
SD15 generation issue fixed.
I don't have significant performance boost. Only sdxl + 2 cn has about 10-15% boost. Btw as in Forge I have the similar too:
rtx 3060 + 10400f
--medvram-sdxl --xformers --disable-model-loading-ram-optimization
python: 3.11.6 • torch: 2.1.2+cu121 • xformers: 0.0.23.post1
First column = non-patched | second = patched
sd1 1 image:
3.0 sec. | 3.0 sec.
sd1 batch_size 10:
22.8 sec. | 23.0 sec.
sd1 + cn canny, depth, batch_size 10:
38.0 sec. | 37.6 sec.
sdxl 1 image:
17.9 sec. | 17.2 sec.
sdxl + cn canny, depth 1 image:
39.0 sec. | 34.6 sec.
AnimateDiff + CN Inpaint + SparseCtrl works
Maybe I have CPU which doesn't fit to GPU, or vise versa. But it's not worse then non-patched, and other users have boost, so I like this work. 👍🏻 Now I will test it on 2gb gpu
I tried to select the best time, but it's definitely slower on the very low vram setup. Maybe it conflicts with some optimizations?
mx150 2gb aka gt 1030
--xformers --lowvram
Optimizations are in the screenshot
python: 3.10.6 • torch: 2.1.2+cu121 • xformers: 0.0.23.post1
First column = non-patched | second = patched
sd1 + merged lcm lora:
20.3 sec. | 22.3 sec.
sd1 + merged lcm lora + t2ia canny:
22.1 sec. | 22.8 sec.
sd1 + merged lcm lora + hiresfix 2x + tiled vae:
2 min. 22.1 sec. | 2 min. 40.7 sec.
I tried to select the best time, but it's definitely slower on the very low vram setup. Maybe it conflicts with some optimizations?
mx150 2gb aka gt 1030 --xformers --lowvram Optimizations are in the screenshot python: 3.10.6 • torch: 2.1.2+cu121 • xformers: 0.0.23.post1 First column = non-patched | second = patched sd1 + merged lcm lora: 20.3 sec. | 22.3 sec. sd1 + merged lcm lora + t2ia canny: 22.1 sec. | 22.8 sec. sd1 + merged lcm lora + hiresfix 2x + tiled vae: 2 min. 22.1 sec. | 2 min. 40.7 sec.
Can you attach traces of your experiment? I am not sure which part of the optimization is affecting low vram perf. You can record trace according to instruction in https://github.com/lllyasviel/stable-diffusion-webui-forge/discussions/716
Running 2 steps should probably be enough.
Okay @huchenlei
sd1 + merged lcm lora + t2ia canny
4 steps
Non-patched:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
model_inference 10.60% 2.483s 100.00% 23.436s 23.436s 0.000us 0.00% 20.588s 20.588s 0 b -3.69 Gb 993.50 Kb -9.40 Gb 1
cudaMemcpyAsync 72.21% 16.922s 72.46% 16.982s 2.585ms 54.251ms 0.27% 54.251ms 8.259us 0 b 0 b 0 b 0 b 6569
aten::to 0.30% 70.688ms 68.23% 15.990s 1.408ms 0.000us 0.00% 6.300s 554.910us 3.66 Gb 13.43 Mb 14.53 Gb 262.38 Mb 11353
aten::_to_copy 0.48% 113.589ms 68.05% 15.949s 1.462ms 0.000us 0.00% 6.335s 580.598us 3.66 Gb 598.63 Kb 14.53 Gb 0 b 10912
aten::copy_ 0.74% 172.485ms 67.06% 15.715s 1.401ms 6.000s 29.97% 6.426s 572.681us 0 b 0 b 0 b 0 b 11221
aten::conv2d 0.05% 11.844ms 9.87% 2.314s 2.738ms 0.000us 0.00% 11.814s 13.981ms 0 b 0 b 2.90 Gb -4.70 Gb 845
aten::item 0.00% 89.000us 9.56% 2.241s 149.386ms 0.000us 0.00% 12.138ms 809.200us 0 b 0 b 0 b 0 b 15
aten::_local_scalar_dense 0.00% 241.000us 9.56% 2.241s 149.380ms 15.000us 0.00% 12.138ms 809.200us 0 b 0 b 0 b 0 b 15
aten::convolution 0.02% 3.958ms 8.77% 2.056s 4.558ms 0.000us 0.00% 6.954s 15.419ms 0 b 0 b 1.97 Gb 0 b 451
aten::_convolution 0.05% 10.801ms 8.75% 2.052s 4.549ms 0.000us 0.00% 6.954s 15.419ms 0 b 0 b 1.97 Gb -4.00 Mb 451
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 23.436s
Self CUDA time total: 20.022s
Patched:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
model_inference 19.16% 9.898s 100.00% 51.653s 51.653s 0.000us 0.00% 40.367s 40.367s 0 b -3.68 Gb 7.24 Mb -18.52 Gb 1
cudaMemcpyAsync 59.11% 30.531s 60.30% 31.144s 4.745ms 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 6564
aten::to 0.19% 98.962ms 58.74% 30.340s 2.701ms 0.000us 0.00% 7.325s 652.075us 3.66 Gb 522.07 Kb 14.50 Gb 237.55 Mb 11233
aten::_to_copy 0.46% 239.181ms 58.56% 30.249s 2.800ms 0.000us 0.00% 7.349s 680.168us 3.66 Gb 0 b 14.50 Gb 0 b 10804
aten::copy_ 0.71% 365.811ms 57.60% 29.754s 2.677ms 7.471s 18.51% 7.471s 672.260us 0 b 0 b 0 b 0 b 11113
aten::conv2d 0.04% 20.753ms 16.41% 8.474s 10.040ms 0.000us 0.00% 26.478s 31.372ms 0 b 0 b 2.90 Gb -4.70 Gb 844
aten::convolution 0.02% 8.169ms 15.56% 8.038s 17.824ms 0.000us 0.00% 17.953s 39.807ms 0 b 0 b 1.97 Gb 0 b 451
aten::_convolution 0.04% 23.143ms 15.55% 8.030s 17.806ms 0.000us 0.00% 17.953s 39.807ms 0 b 0 b 1.97 Gb -4.00 Mb 451
aten::cudnn_convolution 0.85% 437.998ms 15.42% 7.966s 17.662ms 15.939s 39.48% 15.939s 35.341ms 0 b 0 b 1.97 Gb 1.79 Gb 451
cudaFree 13.98% 7.223s 14.52% 7.499s 299.956ms 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 25
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 51.653s
Self CUDA time total: 40.368s
For some reason with tracing time difference become much more bigger
It is very strange. Both were made about 3 times, and I pasted the last, to exclude any model loading from disk and cn preprocessing
NB: this gpu has very slow vram, maybe it can be connected with this
Visually 2 first steps are okay, but the last 2 steps are slower after patch
Also I'm attaching trace files trace_non_patched.json.gz trace_patched.json.gz
Use Github client to checkout this PR using command gh pr checkout 15821
Just want to add you can checkout any Github PR without the GH CLI client using standard Git commands:
git fetch origin pull/ID/head:NAME
git checkout NAME
In this case, for example:
git fetch origin pull/15821/head:15821 && git checkout 15821
Just to save someone the install if they don't need the GH client otherwise.
Great job! Merged the remote branch locally, and I'm now seeing faster gens on A1111 than on ComfyUI :)
I just tested it has speed improvement (around 8% for RTX 3090) however auto cast fails and we get black output on SDXL
I just tested it has speed improvement (around 8% for RTX 3090) however auto cast fails and we get black output on SDXL
precision half disables autocast, so you should not use fp8 option in settings as well. Casting during inference is a big source of performance overhead.
I just tested it has speed improvement (around 8% for RTX 3090) however auto cast fails and we get black output on SDXL
precision half disables autocast, so you should not use fp8 option in settings as well. Casting during inference is a big source of performance overhead.
this is the only option i started
@echo off set PYTHON= set GIT= set VENV_DIR= set COMMANDLINE_ARGS=--xformers call webui.bat
I just tested it has speed improvement (around 8% for RTX 3090) however auto cast fails and we get black output on SDXL
Using sdxl-vae-fp16-fix as the VAE seems to fix the black output. Tested using --precision half
Some checkpoints like AlbedoBaseXL will work as-is.
I don't have extensive numbers, but 512x512 generates noticeably faster, as fast as Forge if not faster. 2 seconds to generate an image with 40 steps using DPM++ 2M SGM Uniform with a regular 1.5 checkpoint, and 15 seconds at 1024x 1024 with SDXL, on a 3080 12GB.
I feel like highres fix and img2img are the bigger bottlenecks now, but I dunno how feasible it is to optimize them even further, especially since these fixes did also noticeably increase their speed. Maybe on the side of the upscalers, since some are noticeably slower than others just by their nature, but I guess it just is what it is due to hardware.
Also ran into an issue, might be related to --precision-half too:
RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half
Seems to happen when loading an SD 1.5 checkpoint, then loading an SDXL checkpoint and trying to generate. Loading a different SDXL checkpoint seems to fix it, but then happens all over again when loading between SD 1.5 and SDXL. This also happens if the checkpoint the UI loads by default is SD 1.5.
Quick test using a 3060, doing a 4-batch at 896x1152 with 2 LORAs at 20 steps, DPM++ 3M Exponential
Forge: 1:20. A1111 with this PR: 0:58
A single image with this PR is around 15 sec. Very nice! I'm going to use this from now on unless some issue comes up. Great work!
the --precision half not work in my case, adding it triggers tracing to several u-net hijack extensions even I have disabled them.
Main error
File "I:\stable-diffusion-webui-updated\venv\lib\site-packages\torch\functional.py", line 377, in einsum return _VF.einsum(equation, operands) # type: ignore[attr-defined] RuntimeError: expected scalar type Half but found Float
I'd love to go back to A1111 from forge, but the issue where the VRAM shoots up extremely on the last step of a generation (Im assuming its the VAE decode), makes me not want to touch it anymore. This doesnt happen on any other ui. I wish someone could figure that out. :(
the --precision half not work in my case, adding it triggers tracing to several u-net hijack extensions even I have disabled them.
Main error
File "I:\stable-diffusion-webui-updated\venv\lib\site-packages\torch\functional.py", line 377, in einsum return _VF.einsum(equation, operands) # type: ignore[attr-defined] RuntimeError: expected scalar type Half but found Float
Can you attach full stacktrace?
the issue where the VRAM shoots up extremely on the last step of a generation (Im assuming its the VAE decode),
@Zotikus1001 If you haven't, download multidiffusion-upscaler-for-automatic1111 and enable "Tiled VAE". Despite the name, it's not just about upscale.
Forge is using an integrated version of this under the hood if I remember right.
It would be nice to have the other memory management tricks like moving models, etc though, since I can pretty much use HR Fix with any size reliably in Forge.
the issue where the VRAM shoots up extremely on the last step of a generation (Im assuming its the VAE decode),
@Zotikus1001 If you haven't, download multidiffusion-upscaler-for-automatic1111 and enable "Tiled VAE". Despite the name, it's not just about upscale.
Forge is using an integrated version of this under the hood if I remember right.
It would be nice to have the other memory management tricks like moving models, etc though, since I can pretty much use HR Fix with any size reliably in Forge.
that doesnt work for this issue sadly. It still happens.
the issue where the VRAM shoots up extremely on the last step of a generation (Im assuming its the VAE decode),
@Zotikus1001 If you haven't, download multidiffusion-upscaler-for-automatic1111 and enable "Tiled VAE". Despite the name, it's not just about upscale. Forge is using an integrated version of this under the hood if I remember right. It would be nice to have the other memory management tricks like moving models, etc though, since I can pretty much use HR Fix with any size reliably in Forge.
that doesnt work for this issue sadly. It still happens.
Does it happen only during the HiRes Fix stage? That's the only place I exceed 24GB VRAM. Doesn't happen with Forge
the issue where the VRAM shoots up extremely on the last step of a generation (Im assuming its the VAE decode),
@Zotikus1001 If you haven't, download multidiffusion-upscaler-for-automatic1111 and enable "Tiled VAE". Despite the name, it's not just about upscale. Forge is using an integrated version of this under the hood if I remember right. It would be nice to have the other memory management tricks like moving models, etc though, since I can pretty much use HR Fix with any size reliably in Forge.
that doesnt work for this issue sadly. It still happens.
Does it happen only during the HiRes Fix stage? That's the only place I exceed 24GB VRAM. Doesn't happen with Forge
No, it happens every time, ofc the higher res I go, the worse it is. Where on forge or comfyui or invoke there's no increase in VRAM at all, not a single hickup.
I saw a 10-19% speedup when using --precision half along with --opt-channelslast after merging this. Newer accelerators will benefit more from these changes but that's not to say that older ones aren't getting an uplift either. There was about a 9% speedup going from 30 to 40 series.


You can see more details here
the --precision half not work in my case, adding it triggers tracing to several u-net hijack extensions even I have disabled them. Main error
File "I:\stable-diffusion-webui-updated\venv\lib\site-packages\torch\functional.py", line 377, in einsum return _VF.einsum(equation, operands) # type: ignore[attr-defined] RuntimeError: expected scalar type Half but found FloatCan you attach full stacktrace? here it is: trace.txt
Thank you so much for these improvements! 1111 speed is now on par with forge thanks to you. :)

