stable-diffusion-webui Implement Deepcache Optimization

Description

DeepCache, Yet another optimization

For adjacent timesteps, the result of each layers can be considered 'almost-same' for some cases.

We can just cache them.

Note : this is more beneficial when we have very many step such as DDIM environment.

It won't produce dramatic improvement in few-step inference, especially LCM.

The implementation was modified with gist and patched compatibility too.

Speed benchmark with 1.5 models, results will be added:

Vanilla 512x704, 23 step, DPM++ SDE Karras sampler, 2x with Anime6B, 5-sample inference 2.67 it/s

Hypertile(All) 3.74 it/s

DeepCache 3.02 it/s

DeepCache + HyperTile 4.59 it/s

Compatibility

The optimization is compatible with Controlnet, at least. (2.6 it/s, 512x680 2x vs 2.0it/s (without)) With both, we can achieve 4.7 it/s - yes - it is faster because it reuses whole cache in hires pass.

Should be tested

We can currently change checkpoint with Refiner / Hires.fix Then, should be invalidate the cache? or, should we just use it?

Screenshots/videos:

Works with Hypertile too,

Checklist:

[x] I have read contributing wiki page
[x] I have performed a self-review of my own code
[x] My code follows the style guidelines
[x] My code passes tests

Dec 05 '23 17:12 aria1th

To test this on SDXL, go to forward_timestep_embed_patch.py and replace "ldm" with "sgm"

Dec 05 '23 22:12 gel-crabs

sounding nice

the hyper tile didnt bring any speed on SDXL

how about this?

Dec 05 '23 22:12 FurkanGozukara

sounding nice

the hyper tile didnt bring any speed on SDXL

how about this?

Enormous speed boost. Around 2-3x faster when it kicks in. However, I'm currently unable to get good quality results with it; I think the forward timestep embed patch might need to be further adapted to the SGM version, I'm not sure though.

Dec 05 '23 23:12 gel-crabs

@gel-crabs I will do more test within 18 hours, but I guess this should work (as they share same structure) @FurkanGozukara The XL Code has released 5 hours ago, but I will have chance to implement this within day... not immediately. But the code seems to be very large....

Dec 05 '23 23:12 aria1th

@gel-crabs I guess we might have to adjust the indexes of in-out blocks, XL Unet is more deeper, so using shallow parts earlier would lead to cache 'noisy' semantic informations.

Note: current implementation is quite different from original paper, follows the gist snippet... and its more suitable for frequently used samplers

Dec 05 '23 23:12 aria1th

@gel-crabs I will do more test within 18 hours, but I guess this should work (as they share same structure) @FurkanGozukara The XL Code has released 5 hours ago, but I will have chance to implement this within day... not immediately. But the code seems to be very large....

I adapted it to use the SGM code and the results are the exact same, so it doesn't need to be further adapted to SGM. I'm gonna do some testing with the in-out blocks and see how it goes.

Dec 06 '23 00:12 gel-crabs

Temporary update : I think the implementation should be modified to follow the original paper again.

Original paper says that we should sample the values for nearby steps, not by duration basis.

Although we can only optimize the last final steps, for SD XL, I don't think current one is accurate... thus this should be fixed again.

Block indexes : 0, 0, 0

Dec 06 '23 02:12 aria1th

Alright, I think I've gotten the correct blocks for SDXL: Screenshot_20231205_213255 So pretty much just the Cache In Block Indexes changed to 8 and 7.

Still quality loss, the contrast is noticeably higher which I've found is caused by the cache mid.

Dec 06 '23 02:12 gel-crabs

768x768 test

Hypertile only 7.86it/s

Index 0, 0, 0

Cache Rate 27.23%, 8.03it/s

Index 8, 8, 8 Cache Rate 27.23%, 8.61it/s

Index 0, 0, 5 Cache rate 42.37% 10.8 it/s

0, 0, 6 Cache rate 45.4% 11.1it./s

0, 0, 8

Cache rate 51.45%, 11.51 it/s

0, 0, 8 + Cache out start timestep 600 46.2%, 10.42 it/s

0, 0, 8 + Cache out start timestep 600 + interval 50 34.9%, 9.18it/s

@gel-crabs

I think we can use 0, 0, 8 for most case

Dec 06 '23 06:12 aria1th

Very interesting results. Thanks for your effort @aria1th! If need any assistance, please feel free to reach out to us at any time.

Dec 06 '23 10:12 VainF

cache looks like degraded quality significantly? @aria1th

also hyper tile looks like do not degrade quality right?

Dec 06 '23 10:12 FurkanGozukara

@FurkanGozukara yes, its quality is degraded in xl type models - it requires more experiments or.. maybe re-implementation. It did not happen to 1.5-types though.

Dec 06 '23 10:12 aria1th

@FurkanGozukara yes, its quality is degraded in xl type models - it requires more experiments or.. maybe re-implementation. It did not happen to 1.5-types though.

I have a feeling it has something to do with the extra IN/MID/OUT blocks in SDXL. For instance in SD 1.5 IN710 corresponds to a layer, while in SDXL the equivalent is IN710-719 (so 10 blocks compared to 1).

The Elements tab in the SuperMerger extension is really good for showing this information. The middle block has 9 extra blocks in SDXL as well, so I'm betting it has something to do with that.

Dec 06 '23 15:12 gel-crabs

Oops, didn't see the new update. MUCH less quality loss than before. I'm gonna keep testing and see what I can find.

So the settings are this, right?

In block index: 0 In block index 2: 0 Out block index: 8

Dec 06 '23 16:12 gel-crabs

Sorry for the spam, results and another question:

So with these settings on SDXL:

In block index: 8 In block index 2: 8 Out block index: 0 All starts set to 800, plus timestep refresh set to 50

I get next to no quality loss (even an upgrade!), however, the speedup is smaller, pretty much equivalent to a second HyperTile. So my question is: does the block cache index have any effect on the blocks before or after it? For instance, if the out block index is set to 8, does it cache the ones before it as well?

I ask this because there is another output block with the same resolution, which could be cached in addition to the output block cached already. I've gotten similarly high quality (and faster) results with in-blocks set to 7 and 8, which are the same resolution on SDXL.

If it gels with DeepCache I think a second Cache Out Block Index could result in a further speedup.

Dec 06 '23 16:12 gel-crabs

@gel-crabs I fixed some explanations - for in types, it applies to after-index, thus -1-> all caching For out types, it applies to before-index, thus 9-> all

Timestep is kinda important feature- if we use 1000, it means we won't refresh any cache after once we get it.

This stands for 1.5 type models - which means they seems to already know what to draw, at first cache point (!). This somehow explains few more things too... anyway

However, XL models seems to have problem with this - they have to refresh cache frequently, they are very dynamic with it.

Unfortunately, refreshing cache directly leads to cache failure rate increase, thus less performance increase...

I'll test with mid blocks too.

Dec 06 '23 23:12 aria1th

I should also explain why quality gets degraded even when we use less cache than all caching- its about input-output mismatching.

To summarize, there are corresponding pairs to each caches, (as UNet blocks).

In another words, if we increase input block id level, then we have to decrease output block id level.

(Images will be attached for further reference)

However, I guess I should use more recent implementation- or convert from pipeline... I'll be able to do this in about 12-24 hours. https://gist.github.com/laksjdjf/435c512bc19636e9c9af4ee7bea9eb86

Dec 06 '23 23:12 aria1th

New implementation - should be tested though https://github.com/aria1th/sd-webui-deepcache-standalone

SD 1.5

512x704 test, with 40% disable for initial steps

Steps: 23, Sampler: DPM++ SDE Karras, CFG scale: 8, Seed: 3335110679, Size: 512x704, Model hash: 8c838299ab, VAE hash: 79e225b92f, VAE: blessed2.vae.pt, Denoising strength: 0.5, Hypertile U-Net: True, Hypertile U-Net max depth: 2, Hypertile U-Net max tile size: 64, Hypertile U-Net swap size: 12, Hypertile VAE: True, Hypertile VAE swap size: 2, Hires upscale: 2, Hires upscaler: R-ESRGAN 4x+ Anime6B, Version: v1.7.0-RC-16-geb2b1679

Enabled, Reusing cache for HR steps grid-0671-3335110679-1girl 5.68it/s

Enabled: grid-0660-3335110679-1girl 4.66it/s

Vanilla with Hypertile: grid-0661-3335110679-1girl 2.21it/s

Vanilla without Hypertile

grid-0664-3335110679-1girl 1.21it/s Vanilla with DeepCache Only grid-0665-3335110679-1girl 2.83it/s

SD XL :

1girl
Negative prompt: easynegative, nsfw
Steps: 23, Sampler: DPM++ SDE Karras, CFG scale: 8, Seed: 3335110679, Size: 768x768, Model hash: 9a0157cad2, VAE hash: 235745af8d, VAE: sdxl_vae(1).safetensors, Denoising strength: 0.5, Hypertile U-Net: True, Hypertile U-Net max depth: 2, Hypertile U-Net max tile size: 64, Hypertile U-Net swap size: 12, Hypertile VAE: True, Hypertile VAE swap size: 2, Hires upscale: 2, Hires upscaler: R-ESRGAN 4x+ Anime6B, Version: v1.7.0-RC-16-geb2b1679

** DeepCache + HR + Hypertile**

grid-0672-3335110679-1girl 2.65it/s 16.41GB (fp16)

Without optimization grid-0673-3335110679-1girl

1.47it/s

Dec 07 '23 15:12 aria1th

@gel-crabs Now it should work for both models!

Dec 07 '23 16:12 aria1th

@gel-crabs Now it should work for both models!

Yeah, it works great! What Cache Resnet level did you use for SDXL?

(Also, what is your Hypertile VAE max tile size?)

Oh yeah, and another thing: I'm getting this in the console.

Screenshot_20231207_123034

But yeah, the speedup here is absolutely immense. Do not miss out on this.

Dec 07 '23 17:12 gel-crabs

@gel-crabs Resnet level 0, which is max as supposed to be - VAE max tile size was set to 128, swap size 6 The logs are removed!

Dec 07 '23 17:12 aria1th

@gel-crabs Resnet level 0, which is max as supposed to be - VAE max tile size was set to 128, swap size 6 The logs are removed!

Ahh, thank you! One more thing, perhaps another step percentage for HR fix?

Also, this literally halves the time it takes to generate an image. And it barely even changes the image at all. Thank you so much for your work.

Dec 07 '23 18:12 gel-crabs

@gel-crabs HR fix will use 100% cache (if option enabled, and, actually success / failure rate is now requires rework, some are steps and some are function call...) But I guess it has to be checked with controlnet / other extensions too.

Dec 07 '23 18:12 aria1th

@gel-crabs HR fix will use 100% cache (if option enabled, and, actually success / failure rate is now requires rework, some are steps and some are function call...) But I guess it has to be checked with controlnet / other extensions too.

Dang, I just checked with ControlNet and it makes the image go full orange. Dynamic Thresholding works perfectly though.

Dec 07 '23 18:12 gel-crabs

https://github.com/Mikubill/sd-webui-controlnet/blob/main/scripts/hook.py#L425 Okay, this explains why we have bunch of more big code....

Dec 07 '23 18:12 aria1th

https://github.com/aria1th/sd-webui-controlnet/tree/maybe-deepcache-wont-work

I was trying some various implementation including diffusers pipeline, and I guess it does not work well with ControlNet....

https://github.com/horseee/DeepCache/issues/4

ControlNet obviously handles timestep-dependent embedding, which changes the output of U-Net drastically.

Thus, this is expected output.

Compared to this: 06599-3502206729-1girl

Also, I had to patch the controlnet extension, somehow hook override was not working if I offer the patched function in-place, even if it has executed correctly - it completely ignored controlnet.

Thus, in this level, I will just continue to release this as extension - unless someone comes up with great compatible code, you should only use it without controlnet 😢

Dec 08 '23 12:12 aria1th

Aww man, that sucks. This is seriously a game changer. :(

Also, it doesn't appear to work with FreeU. The HR fix only speeds up after the original step percentage, I assume because it doesn't cache the steps before the step percentage.

Dec 08 '23 15:12 gel-crabs

@gel-crabs Yeah, most of the U-Net forward hijacking functions won't work with this, It assumes the nearby step's effects are similar.

Some more academical stuff:

DDIM works well with this. its hidden states will change smoothly, we can use nearby weights. LCM won't even work with this. Some schedulers work drastically fast at initial steps, thus we can safely disable for those steps - yes, that's what you see as parameter.

It means whenever the UNet values have to change, the caching will mess up.

But, I guess for training, this can be kinda useful - we can force model to denoise with cache assumption? (Meanwhile HyperTile is already useful for training)

Dec 08 '23 15:12 aria1th

@gel-crabs Yeah, most of the U-Net forward hijacking functions won't work with this, It assumes the nearby step's effects are similar.

Some more academical stuff:

DDIM works well with this. its hidden states will change smoothly, we can use nearby weights. LCM won't even work with this. Some schedulers work drastically fast at initial steps, thus we can safely disable for those steps - yes, that's what you see as parameter.

It means whenever the UNet values have to change, the caching will mess up.

But, I guess for training, this can be kinda useful - we can force model to denoise with cache assumption? (Meanwhile HyperTile is already useful for training)

The deepcache is available on Webui now? How can we use it ?

Mar 04 '24 06:03 bigmover

@bigmover https://github.com/aria1th/sd-webui-deepcache-standalone Use the extension please, and note that it can't be used with controlnet / some other specific unet hijacking extensions

Mar 04 '24 10:03 aria1th

stable-diffusion-webui stable-diffusion-webui copied to clipboard

Implement Deepcache Optimization

Description

Compatibility

Should be tested

Screenshots/videos:

Checklist:

SD 1.5

SD XL :

stable-diffusion-webui
stable-diffusion-webui copied to clipboard