stable-diffusion-webui icon indicating copy to clipboard operation
stable-diffusion-webui copied to clipboard

[DO NOT MERGE] All perf improvements bundle

Open huchenlei opened this issue 1 year ago • 28 comments

DO NOT MERGE THIS PR, merge individual PRs instead. This PR is for users to try out all performance improvements together.

Description

This is a bundle PR of all performance improvement PRs:

  • https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/15803
  • https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/15804
  • https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/15805
  • https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/15806
  • https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/15816
  • https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/15820

How to use

  • Checkout this PR

    • In your A1111 repo directory, open terminal
    • Use Github client to checkout this PR using command gh pr checkout 15821
  • Add --precision half to your command line args if your GPU supports fp16 calculation.

Unpatch the PR

  • In A1111 repo directory, open terminal
  • git checkout master

Expected performance improvement

For SDXL, this PR brings the performance from 580ms/it to 280ms/it on my machine. However this is only for Unet's denosing steps, not including all other factors such as VAE encode/decode and save the image, but overall you should expect at least 20% performance boost.

Report issues

Please report any bugs related to this batch of performance improvements to https://github.com/huchenlei/stable-diffusion-webui/issues

My tests on these PRs have limited coverage, so some features might get broken, and I would like to get these fixed before merging.

Checklist:

huchenlei avatar May 17 '24 00:05 huchenlei

You can mark PR as draft to exclude accidentally merge

light-and-ray avatar May 17 '24 01:05 light-and-ray

Clicks Generate

Traceback (most recent call last):
      File "D:\AI\stable-diffusion-webui\modules\call_queue.py", line 57, in f
        res = list(func(*args, **kwargs))
      File "D:\AI\stable-diffusion-webui\modules\call_queue.py", line 36, in f
        res = func(*args, **kwargs)
      File "D:\AI\stable-diffusion-webui\modules\txt2img.py", line 109, in txt2img
        processed = processing.process_images(p)
      File "D:\AI\stable-diffusion-webui\modules\processing.py", line 839, in process_images
        res = process_images_inner(p)
      File "D:\AI\stable-diffusion-webui\extensions\sd-webui-controlnet\scripts\batch_hijack.py", line 59, in processing_process_images_hijack
        return getattr(processing, '__controlnet_original_process_images_inner')(p, *args, **kwargs)
      File "D:\AI\stable-diffusion-webui\modules\processing.py", line 975, in process_images_inner
        samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, seeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=p.prompts)
      File "D:\AI\stable-diffusion-webui\modules\processing.py", line 1322, in sample
        samples = self.sampler.sample(self, x, conditioning, unconditional_conditioning, image_conditioning=self.txt2img_image_conditioning(x))
      File "D:\AI\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 218, in sample
        samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args=self.sampler_extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs))
      File "D:\AI\stable-diffusion-webui\modules\sd_samplers_common.py", line 272, in launch_sampling
        return func()
      File "D:\AI\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 218, in <lambda>
        samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args=self.sampler_extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs))
      File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
        return func(*args, **kwargs)
      File "D:\AI\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\sampling.py", line 594, in sample_dpmpp_2m
        denoised = model(x, sigmas[i] * s_in, **extra_args)
      File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
        return forward_call(*args, **kwargs)
      File "D:\AI\stable-diffusion-webui\modules\sd_samplers_cfg_denoiser.py", line 237, in forward
        x_out = self.inner_model(x_in, sigma_in, cond=make_condition_dict(cond_in, image_cond_in))
      File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
        return forward_call(*args, **kwargs)
      File "D:\AI\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\external.py", line 112, in forward
        eps = self.get_eps(input * c_in, self.sigma_to_t(sigma), **kwargs)
      File "D:\AI\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\external.py", line 138, in get_eps
        return self.inner_model.apply_model(*args, **kwargs)
      File "D:\AI\stable-diffusion-webui\modules\sd_hijack_utils.py", line 22, in <lambda>
        setattr(resolved_obj, func_path[-1], lambda *args, **kwargs: self(*args, **kwargs))
      File "D:\AI\stable-diffusion-webui\modules\sd_hijack_utils.py", line 34, in __call__
        return self.__sub_func(self.__orig_func, *args, **kwargs)
      File "D:\AI\stable-diffusion-webui\modules\sd_hijack_unet.py", line 48, in apply_model
        result = orig_func(self, x_noisy.to(devices.dtype_unet), t.to(devices.dtype_unet), cond, **kwargs)
      File "D:\AI\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\models\diffusion\ddpm.py", line 858, in apply_model
        x_recon = self.model(x_noisy, t, **cond)
      File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
        return forward_call(*args, **kwargs)
      File "D:\AI\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\models\diffusion\ddpm.py", line 1335, in forward
        out = self.diffusion_model(x, t, context=cc)
      File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
        return forward_call(*args, **kwargs)
      File "D:\AI\stable-diffusion-webui\modules\sd_unet.py", line 91, in UNetModel_forward
        return original_forward(self, x, timesteps, context, *args, **kwargs)
      File "D:\AI\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\openaimodel.py", line 797, in forward
        h = module(h, emb, context)
      File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
        return forward_call(*args, **kwargs)
      File "D:\AI\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\openaimodel.py", line 86, in forward
        x = layer(x)
      File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
        return forward_call(*args, **kwargs)
      File "D:\AI\stable-diffusion-webui\extensions\a1111-sd-webui-lycoris\l_networks.py", line 524, in network_Conv2d_forward
        return originals.Conv2d_forward(self, input)
      File "D:\AI\stable-diffusion-webui\extensions-builtin\Lora\networks.py", line 523, in network_Conv2d_forward
        return originals.Conv2d_forward(self, input)
      File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\conv.py", line 463, in forward
        return self._conv_forward(input, self.weight, self.bias)
      File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\conv.py", line 459, in _conv_forward
        return F.conv2d(input, weight, bias, self.stride,
    RuntimeError: Input type (float) and bias type (struct c10::Half) should be the same

Now what?

Gushousekai195 avatar May 17 '24 04:05 Gushousekai195

In my environment, the SDXL model failed to load because FP8 weight was set to enabled in the settings optimization. image

When I disabled it, the SDXL model loaded without problems.

serick4126 avatar May 17 '24 05:05 serick4126

Getting similar error as @Gushousekai195 (no lycoris, happens even when all loras disabled). One of these patches is breaking SD1.5. Only SDXL works.

Edit: Narrowed it down to --precision half

feffy380 avatar May 17 '24 07:05 feffy380

SDXL, Nvidia 4090 + Intel 12700K - seeing a 22.04% increase in speed. No (noticeable?) effect on image output.

bob7l avatar May 17 '24 16:05 bob7l

SD15 generation issue fixed.

huchenlei avatar May 17 '24 17:05 huchenlei

I don't have significant performance boost. Only sdxl + 2 cn has about 10-15% boost. Btw as in Forge I have the similar too:

rtx 3060 + 10400f
--medvram-sdxl --xformers --disable-model-loading-ram-optimization
python: 3.11.6  •  torch: 2.1.2+cu121  •  xformers: 0.0.23.post1

First column = non-patched | second = patched

sd1 1 image:
3.0 sec. | 3.0 sec.

sd1 batch_size 10:
22.8 sec. | 23.0 sec.

sd1 + cn canny, depth, batch_size 10:
38.0 sec. | 37.6 sec.

sdxl 1 image:
17.9 sec. | 17.2 sec.

sdxl + cn canny, depth 1 image:
39.0 sec. | 34.6 sec.

AnimateDiff + CN Inpaint + SparseCtrl works

Maybe I have CPU which doesn't fit to GPU, or vise versa. But it's not worse then non-patched, and other users have boost, so I like this work. 👍🏻 Now I will test it on 2gb gpu

light-and-ray avatar May 17 '24 18:05 light-and-ray

I tried to select the best time, but it's definitely slower on the very low vram setup. Maybe it conflicts with some optimizations?

mx150 2gb aka gt 1030
--xformers --lowvram
Optimizations are in the screenshot
python: 3.10.6  •  torch: 2.1.2+cu121  •  xformers: 0.0.23.post1

First column = non-patched | second = patched

sd1 + merged lcm lora:
20.3 sec. | 22.3 sec.

sd1 + merged lcm lora + t2ia canny:
22.1 sec. | 22.8 sec.

sd1 + merged lcm lora + hiresfix 2x + tiled vae:
2 min. 22.1 sec. | 2 min. 40.7 sec.

Screenshot_20240517_223154

light-and-ray avatar May 17 '24 18:05 light-and-ray

I tried to select the best time, but it's definitely slower on the very low vram setup. Maybe it conflicts with some optimizations?

mx150 2gb aka gt 1030
--xformers --lowvram
Optimizations are in the screenshot
python: 3.10.6  •  torch: 2.1.2+cu121  •  xformers: 0.0.23.post1

First column = non-patched | second = patched

sd1 + merged lcm lora:
20.3 sec. | 22.3 sec.

sd1 + merged lcm lora + t2ia canny:
22.1 sec. | 22.8 sec.

sd1 + merged lcm lora + hiresfix 2x + tiled vae:
2 min. 22.1 sec. | 2 min. 40.7 sec.

Screenshot_20240517_223154

Can you attach traces of your experiment? I am not sure which part of the optimization is affecting low vram perf. You can record trace according to instruction in https://github.com/lllyasviel/stable-diffusion-webui-forge/discussions/716

Running 2 steps should probably be enough.

huchenlei avatar May 17 '24 20:05 huchenlei

Okay @huchenlei

sd1 + merged lcm lora + t2ia canny
4 steps

Non-patched:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                        model_inference        10.60%        2.483s       100.00%       23.436s       23.436s       0.000us         0.00%       20.588s       20.588s           0 b      -3.69 Gb     993.50 Kb      -9.40 Gb             1  
                                        cudaMemcpyAsync        72.21%       16.922s        72.46%       16.982s       2.585ms      54.251ms         0.27%      54.251ms       8.259us           0 b           0 b           0 b           0 b          6569  
                                               aten::to         0.30%      70.688ms        68.23%       15.990s       1.408ms       0.000us         0.00%        6.300s     554.910us       3.66 Gb      13.43 Mb      14.53 Gb     262.38 Mb         11353  
                                         aten::_to_copy         0.48%     113.589ms        68.05%       15.949s       1.462ms       0.000us         0.00%        6.335s     580.598us       3.66 Gb     598.63 Kb      14.53 Gb           0 b         10912  
                                            aten::copy_         0.74%     172.485ms        67.06%       15.715s       1.401ms        6.000s        29.97%        6.426s     572.681us           0 b           0 b           0 b           0 b         11221  
                                           aten::conv2d         0.05%      11.844ms         9.87%        2.314s       2.738ms       0.000us         0.00%       11.814s      13.981ms           0 b           0 b       2.90 Gb      -4.70 Gb           845  
                                             aten::item         0.00%      89.000us         9.56%        2.241s     149.386ms       0.000us         0.00%      12.138ms     809.200us           0 b           0 b           0 b           0 b            15  
                              aten::_local_scalar_dense         0.00%     241.000us         9.56%        2.241s     149.380ms      15.000us         0.00%      12.138ms     809.200us           0 b           0 b           0 b           0 b            15  
                                      aten::convolution         0.02%       3.958ms         8.77%        2.056s       4.558ms       0.000us         0.00%        6.954s      15.419ms           0 b           0 b       1.97 Gb           0 b           451  
                                     aten::_convolution         0.05%      10.801ms         8.75%        2.052s       4.549ms       0.000us         0.00%        6.954s      15.419ms           0 b           0 b       1.97 Gb      -4.00 Mb           451  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 23.436s
Self CUDA time total: 20.022s


Patched:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                        model_inference        19.16%        9.898s       100.00%       51.653s       51.653s       0.000us         0.00%       40.367s       40.367s           0 b      -3.68 Gb       7.24 Mb     -18.52 Gb             1  
                                        cudaMemcpyAsync        59.11%       30.531s        60.30%       31.144s       4.745ms       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b          6564  
                                               aten::to         0.19%      98.962ms        58.74%       30.340s       2.701ms       0.000us         0.00%        7.325s     652.075us       3.66 Gb     522.07 Kb      14.50 Gb     237.55 Mb         11233  
                                         aten::_to_copy         0.46%     239.181ms        58.56%       30.249s       2.800ms       0.000us         0.00%        7.349s     680.168us       3.66 Gb           0 b      14.50 Gb           0 b         10804  
                                            aten::copy_         0.71%     365.811ms        57.60%       29.754s       2.677ms        7.471s        18.51%        7.471s     672.260us           0 b           0 b           0 b           0 b         11113  
                                           aten::conv2d         0.04%      20.753ms        16.41%        8.474s      10.040ms       0.000us         0.00%       26.478s      31.372ms           0 b           0 b       2.90 Gb      -4.70 Gb           844  
                                      aten::convolution         0.02%       8.169ms        15.56%        8.038s      17.824ms       0.000us         0.00%       17.953s      39.807ms           0 b           0 b       1.97 Gb           0 b           451  
                                     aten::_convolution         0.04%      23.143ms        15.55%        8.030s      17.806ms       0.000us         0.00%       17.953s      39.807ms           0 b           0 b       1.97 Gb      -4.00 Mb           451  
                                aten::cudnn_convolution         0.85%     437.998ms        15.42%        7.966s      17.662ms       15.939s        39.48%       15.939s      35.341ms           0 b           0 b       1.97 Gb       1.79 Gb           451  
                                               cudaFree        13.98%        7.223s        14.52%        7.499s     299.956ms       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b            25  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 51.653s
Self CUDA time total: 40.368s

For some reason with tracing time difference become much more bigger

light-and-ray avatar May 17 '24 21:05 light-and-ray

It is very strange. Both were made about 3 times, and I pasted the last, to exclude any model loading from disk and cn preprocessing

NB: this gpu has very slow vram, maybe it can be connected with this

light-and-ray avatar May 17 '24 21:05 light-and-ray

Visually 2 first steps are okay, but the last 2 steps are slower after patch

Also I'm attaching trace files trace_non_patched.json.gz trace_patched.json.gz

light-and-ray avatar May 17 '24 21:05 light-and-ray

Use Github client to checkout this PR using command gh pr checkout 15821

Just want to add you can checkout any Github PR without the GH CLI client using standard Git commands:

git fetch origin pull/ID/head:NAME
git checkout NAME

In this case, for example:

git fetch origin pull/15821/head:15821 && git checkout 15821

Just to save someone the install if they don't need the GH client otherwise.

strawberrymelonpanda avatar May 18 '24 06:05 strawberrymelonpanda

Great job! Merged the remote branch locally, and I'm now seeing faster gens on A1111 than on ComfyUI :)

not-ski avatar May 18 '24 07:05 not-ski

I just tested it has speed improvement (around 8% for RTX 3090) however auto cast fails and we get black output on SDXL

image

FurkanGozukara avatar May 18 '24 09:05 FurkanGozukara

I just tested it has speed improvement (around 8% for RTX 3090) however auto cast fails and we get black output on SDXL

image

precision half disables autocast, so you should not use fp8 option in settings as well. Casting during inference is a big source of performance overhead.

huchenlei avatar May 18 '24 13:05 huchenlei

I just tested it has speed improvement (around 8% for RTX 3090) however auto cast fails and we get black output on SDXL image

precision half disables autocast, so you should not use fp8 option in settings as well. Casting during inference is a big source of performance overhead.

this is the only option i started

@echo off set PYTHON= set GIT= set VENV_DIR= set COMMANDLINE_ARGS=--xformers call webui.bat

image

FurkanGozukara avatar May 18 '24 14:05 FurkanGozukara

I just tested it has speed improvement (around 8% for RTX 3090) however auto cast fails and we get black output on SDXL

Using sdxl-vae-fp16-fix as the VAE seems to fix the black output. Tested using --precision half Some checkpoints like AlbedoBaseXL will work as-is.

b-fission avatar May 18 '24 16:05 b-fission

I don't have extensive numbers, but 512x512 generates noticeably faster, as fast as Forge if not faster. 2 seconds to generate an image with 40 steps using DPM++ 2M SGM Uniform with a regular 1.5 checkpoint, and 15 seconds at 1024x 1024 with SDXL, on a 3080 12GB.

I feel like highres fix and img2img are the bigger bottlenecks now, but I dunno how feasible it is to optimize them even further, especially since these fixes did also noticeably increase their speed. Maybe on the side of the upscalers, since some are noticeably slower than others just by their nature, but I guess it just is what it is due to hardware.

Also ran into an issue, might be related to --precision-half too: RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half

Seems to happen when loading an SD 1.5 checkpoint, then loading an SDXL checkpoint and trying to generate. Loading a different SDXL checkpoint seems to fix it, but then happens all over again when loading between SD 1.5 and SDXL. This also happens if the checkpoint the UI loads by default is SD 1.5.

freecoderwaifu avatar May 18 '24 23:05 freecoderwaifu

Quick test using a 3060, doing a 4-batch at 896x1152 with 2 LORAs at 20 steps, DPM++ 3M Exponential

Forge: 1:20. A1111 with this PR: 0:58

A single image with this PR is around 15 sec. Very nice! I'm going to use this from now on unless some issue comes up. Great work!

mweldon avatar May 19 '24 02:05 mweldon

the --precision half not work in my case, adding it triggers tracing to several u-net hijack extensions even I have disabled them.

Main error File "I:\stable-diffusion-webui-updated\venv\lib\site-packages\torch\functional.py", line 377, in einsum return _VF.einsum(equation, operands) # type: ignore[attr-defined] RuntimeError: expected scalar type Half but found Float

enternalsaga avatar May 19 '24 12:05 enternalsaga

I'd love to go back to A1111 from forge, but the issue where the VRAM shoots up extremely on the last step of a generation (Im assuming its the VAE decode), makes me not want to touch it anymore. This doesnt happen on any other ui. I wish someone could figure that out. :(

Zotikus1001 avatar May 19 '24 12:05 Zotikus1001

the --precision half not work in my case, adding it triggers tracing to several u-net hijack extensions even I have disabled them.

Main error File "I:\stable-diffusion-webui-updated\venv\lib\site-packages\torch\functional.py", line 377, in einsum return _VF.einsum(equation, operands) # type: ignore[attr-defined] RuntimeError: expected scalar type Half but found Float

Can you attach full stacktrace?

huchenlei avatar May 19 '24 13:05 huchenlei

the issue where the VRAM shoots up extremely on the last step of a generation (Im assuming its the VAE decode),

@Zotikus1001 If you haven't, download multidiffusion-upscaler-for-automatic1111 and enable "Tiled VAE". Despite the name, it's not just about upscale.

Forge is using an integrated version of this under the hood if I remember right.

It would be nice to have the other memory management tricks like moving models, etc though, since I can pretty much use HR Fix with any size reliably in Forge.

strawberrymelonpanda avatar May 19 '24 18:05 strawberrymelonpanda

the issue where the VRAM shoots up extremely on the last step of a generation (Im assuming its the VAE decode),

@Zotikus1001 If you haven't, download multidiffusion-upscaler-for-automatic1111 and enable "Tiled VAE". Despite the name, it's not just about upscale.

Forge is using an integrated version of this under the hood if I remember right.

It would be nice to have the other memory management tricks like moving models, etc though, since I can pretty much use HR Fix with any size reliably in Forge.

that doesnt work for this issue sadly. It still happens.

Zotikus1001 avatar May 19 '24 19:05 Zotikus1001

the issue where the VRAM shoots up extremely on the last step of a generation (Im assuming its the VAE decode),

@Zotikus1001 If you haven't, download multidiffusion-upscaler-for-automatic1111 and enable "Tiled VAE". Despite the name, it's not just about upscale. Forge is using an integrated version of this under the hood if I remember right. It would be nice to have the other memory management tricks like moving models, etc though, since I can pretty much use HR Fix with any size reliably in Forge.

that doesnt work for this issue sadly. It still happens.

Does it happen only during the HiRes Fix stage? That's the only place I exceed 24GB VRAM. Doesn't happen with Forge

bob7l avatar May 19 '24 20:05 bob7l

the issue where the VRAM shoots up extremely on the last step of a generation (Im assuming its the VAE decode),

@Zotikus1001 If you haven't, download multidiffusion-upscaler-for-automatic1111 and enable "Tiled VAE". Despite the name, it's not just about upscale. Forge is using an integrated version of this under the hood if I remember right. It would be nice to have the other memory management tricks like moving models, etc though, since I can pretty much use HR Fix with any size reliably in Forge.

that doesnt work for this issue sadly. It still happens.

Does it happen only during the HiRes Fix stage? That's the only place I exceed 24GB VRAM. Doesn't happen with Forge

No, it happens every time, ofc the higher res I go, the worse it is. Where on forge or comfyui or invoke there's no increase in VRAM at all, not a single hickup.

Zotikus1001 avatar May 19 '24 23:05 Zotikus1001

I saw a 10-19% speedup when using --precision half along with --opt-channelslast after merging this. Newer accelerators will benefit more from these changes but that's not to say that older ones aren't getting an uplift either. There was about a 9% speedup going from 30 to 40 series.

You can see more details here

papuSpartan avatar May 22 '24 15:05 papuSpartan

the --precision half not work in my case, adding it triggers tracing to several u-net hijack extensions even I have disabled them. Main error File "I:\stable-diffusion-webui-updated\venv\lib\site-packages\torch\functional.py", line 377, in einsum return _VF.einsum(equation, operands) # type: ignore[attr-defined] RuntimeError: expected scalar type Half but found Float

Can you attach full stacktrace? here it is: trace.txt

enternalsaga avatar May 23 '24 14:05 enternalsaga

Thank you so much for these improvements! 1111 speed is now on par with forge thanks to you. :)

ByteSh0ck avatar May 26 '24 14:05 ByteSh0ck