stable-diffusion-webui
stable-diffusion-webui copied to clipboard
Implement Deepcache Optimization
Description
DeepCache, Yet another optimization
For adjacent timesteps, the result of each layers can be considered 'almost-same' for some cases.
We can just cache them.
Note : this is more beneficial when we have very many step such as DDIM environment.
It won't produce dramatic improvement in few-step inference, especially LCM.
The implementation was modified with gist and patched compatibility too.
Speed benchmark with 1.5 models, results will be added:
Vanilla 512x704, 23 step, DPM++ SDE Karras sampler, 2x with Anime6B, 5-sample inference 2.67 it/s
Hypertile(All)
3.74 it/s
DeepCache
3.02 it/s
DeepCache + HyperTile 4.59 it/s
Compatibility
The optimization is compatible with Controlnet, at least. (2.6 it/s, 512x680 2x vs 2.0it/s (without)) With both, we can achieve 4.7 it/s - yes - it is faster because it reuses whole cache in hires pass.
Should be tested
We can currently change checkpoint with Refiner / Hires.fix Then, should be invalidate the cache? or, should we just use it?
Screenshots/videos:
Works with Hypertile too,
Checklist:
- [x] I have read contributing wiki page
- [x] I have performed a self-review of my own code
- [x] My code follows the style guidelines
- [x] My code passes tests
To test this on SDXL, go to forward_timestep_embed_patch.py and replace "ldm" with "sgm"
sounding nice
the hyper tile didnt bring any speed on SDXL
how about this?
sounding nice
the hyper tile didnt bring any speed on SDXL
how about this?
Enormous speed boost. Around 2-3x faster when it kicks in. However, I'm currently unable to get good quality results with it; I think the forward timestep embed patch might need to be further adapted to the SGM version, I'm not sure though.
@gel-crabs I will do more test within 18 hours, but I guess this should work (as they share same structure) @FurkanGozukara The XL Code has released 5 hours ago, but I will have chance to implement this within day... not immediately. But the code seems to be very large....
@gel-crabs I guess we might have to adjust the indexes of in-out blocks, XL Unet is more deeper, so using shallow parts earlier would lead to cache 'noisy' semantic informations.
Note: current implementation is quite different from original paper, follows the gist snippet... and its more suitable for frequently used samplers
@gel-crabs I will do more test within 18 hours, but I guess this should work (as they share same structure) @FurkanGozukara The XL Code has released 5 hours ago, but I will have chance to implement this within day... not immediately. But the code seems to be very large....
I adapted it to use the SGM code and the results are the exact same, so it doesn't need to be further adapted to SGM. I'm gonna do some testing with the in-out blocks and see how it goes.
Temporary update : I think the implementation should be modified to follow the original paper again.
Original paper says that we should sample the values for nearby steps, not by duration basis.
Although we can only optimize the last final steps, for SD XL, I don't think current one is accurate... thus this should be fixed again.
Block indexes : 0, 0, 0
Alright, I think I've gotten the correct blocks for SDXL:
So pretty much just the Cache In Block Indexes changed to 8 and 7.
Still quality loss, the contrast is noticeably higher which I've found is caused by the cache mid.
768x768 test
Hypertile only
7.86it/s
Index 0, 0, 0
Cache Rate 27.23%, 8.03it/s
Index 8, 8, 8
Cache Rate 27.23%, 8.61it/s
Index 0, 0, 5
Cache rate 42.37% 10.8 it/s
0, 0, 6
Cache rate 45.4% 11.1it./s
0, 0, 8
Cache rate 51.45%, 11.51 it/s
0, 0, 8 + Cache out start timestep 600
46.2%, 10.42 it/s
0, 0, 8 + Cache out start timestep 600 + interval 50
34.9%, 9.18it/s
@gel-crabs
I think we can use 0, 0, 8 for most case
Very interesting results. Thanks for your effort @aria1th! If need any assistance, please feel free to reach out to us at any time.
cache looks like degraded quality significantly? @aria1th
also hyper tile looks like do not degrade quality right?
@FurkanGozukara yes, its quality is degraded in xl type models - it requires more experiments or.. maybe re-implementation. It did not happen to 1.5-types though.
@FurkanGozukara yes, its quality is degraded in xl type models - it requires more experiments or.. maybe re-implementation. It did not happen to 1.5-types though.
I have a feeling it has something to do with the extra IN/MID/OUT blocks in SDXL. For instance in SD 1.5 IN710 corresponds to a layer, while in SDXL the equivalent is IN710-719 (so 10 blocks compared to 1).
The Elements tab in the SuperMerger extension is really good for showing this information. The middle block has 9 extra blocks in SDXL as well, so I'm betting it has something to do with that.
Oops, didn't see the new update. MUCH less quality loss than before. I'm gonna keep testing and see what I can find.
So the settings are this, right?
In block index: 0 In block index 2: 0 Out block index: 8
Sorry for the spam, results and another question:
So with these settings on SDXL:
In block index: 8 In block index 2: 8 Out block index: 0 All starts set to 800, plus timestep refresh set to 50
I get next to no quality loss (even an upgrade!), however, the speedup is smaller, pretty much equivalent to a second HyperTile. So my question is: does the block cache index have any effect on the blocks before or after it? For instance, if the out block index is set to 8, does it cache the ones before it as well?
I ask this because there is another output block with the same resolution, which could be cached in addition to the output block cached already. I've gotten similarly high quality (and faster) results with in-blocks set to 7 and 8, which are the same resolution on SDXL.
If it gels with DeepCache I think a second Cache Out Block Index could result in a further speedup.
@gel-crabs I fixed some explanations - for in types, it applies to after-index, thus -1-> all caching For out types, it applies to before-index, thus 9-> all
Timestep is kinda important feature- if we use 1000, it means we won't refresh any cache after once we get it.
This stands for 1.5 type models - which means they seems to already know what to draw, at first cache point (!). This somehow explains few more things too... anyway
However, XL models seems to have problem with this - they have to refresh cache frequently, they are very dynamic with it.
Unfortunately, refreshing cache directly leads to cache failure rate increase, thus less performance increase...
I'll test with mid blocks too.
I should also explain why quality gets degraded even when we use less cache than all caching- its about input-output mismatching.
To summarize, there are corresponding pairs to each caches, (as UNet blocks).
In another words, if we increase input block id level, then we have to decrease output block id level.
(Images will be attached for further reference)
However, I guess I should use more recent implementation- or convert from pipeline... I'll be able to do this in about 12-24 hours. https://gist.github.com/laksjdjf/435c512bc19636e9c9af4ee7bea9eb86
New implementation - should be tested though https://github.com/aria1th/sd-webui-deepcache-standalone
SD 1.5
512x704 test, with 40% disable for initial steps
Steps: 23, Sampler: DPM++ SDE Karras, CFG scale: 8, Seed: 3335110679, Size: 512x704, Model hash: 8c838299ab, VAE hash: 79e225b92f, VAE: blessed2.vae.pt, Denoising strength: 0.5, Hypertile U-Net: True, Hypertile U-Net max depth: 2, Hypertile U-Net max tile size: 64, Hypertile U-Net swap size: 12, Hypertile VAE: True, Hypertile VAE swap size: 2, Hires upscale: 2, Hires upscaler: R-ESRGAN 4x+ Anime6B, Version: v1.7.0-RC-16-geb2b1679
Enabled, Reusing cache for HR steps
5.68it/s
Enabled:
4.66it/s
Vanilla with Hypertile:
2.21it/s
Vanilla without Hypertile
1.21it/s
Vanilla with DeepCache Only
2.83it/s
SD XL :
1girl
Negative prompt: easynegative, nsfw
Steps: 23, Sampler: DPM++ SDE Karras, CFG scale: 8, Seed: 3335110679, Size: 768x768, Model hash: 9a0157cad2, VAE hash: 235745af8d, VAE: sdxl_vae(1).safetensors, Denoising strength: 0.5, Hypertile U-Net: True, Hypertile U-Net max depth: 2, Hypertile U-Net max tile size: 64, Hypertile U-Net swap size: 12, Hypertile VAE: True, Hypertile VAE swap size: 2, Hires upscale: 2, Hires upscaler: R-ESRGAN 4x+ Anime6B, Version: v1.7.0-RC-16-geb2b1679
** DeepCache + HR + Hypertile**
2.65it/s
16.41GB (fp16)
Without optimization
1.47it/s
@gel-crabs Now it should work for both models!
@gel-crabs Now it should work for both models!
Yeah, it works great! What Cache Resnet level did you use for SDXL?
(Also, what is your Hypertile VAE max tile size?)
Oh yeah, and another thing: I'm getting this in the console.
But yeah, the speedup here is absolutely immense. Do not miss out on this.
@gel-crabs Resnet level 0, which is max as supposed to be - VAE max tile size was set to 128, swap size 6 The logs are removed!
@gel-crabs Resnet level 0, which is max as supposed to be - VAE max tile size was set to 128, swap size 6 The logs are removed!
Ahh, thank you! One more thing, perhaps another step percentage for HR fix?
Also, this literally halves the time it takes to generate an image. And it barely even changes the image at all. Thank you so much for your work.
@gel-crabs HR fix will use 100% cache (if option enabled, and, actually success / failure rate is now requires rework, some are steps and some are function call...) But I guess it has to be checked with controlnet / other extensions too.
@gel-crabs HR fix will use 100% cache (if option enabled, and, actually success / failure rate is now requires rework, some are steps and some are function call...) But I guess it has to be checked with controlnet / other extensions too.
Dang, I just checked with ControlNet and it makes the image go full orange. Dynamic Thresholding works perfectly though.
https://github.com/Mikubill/sd-webui-controlnet/blob/main/scripts/hook.py#L425 Okay, this explains why we have bunch of more big code....
https://github.com/aria1th/sd-webui-controlnet/tree/maybe-deepcache-wont-work
I was trying some various implementation including diffusers pipeline, and I guess it does not work well with ControlNet....
https://github.com/horseee/DeepCache/issues/4
ControlNet obviously handles timestep-dependent embedding, which changes the output of U-Net drastically.
Thus, this is expected output.
Compared to this:
Also, I had to patch the controlnet extension, somehow hook override was not working if I offer the patched function in-place, even if it has executed correctly - it completely ignored controlnet.
Thus, in this level, I will just continue to release this as extension - unless someone comes up with great compatible code, you should only use it without controlnet 😢
Aww man, that sucks. This is seriously a game changer. :(
Also, it doesn't appear to work with FreeU. The HR fix only speeds up after the original step percentage, I assume because it doesn't cache the steps before the step percentage.
@gel-crabs Yeah, most of the U-Net forward hijacking functions won't work with this, It assumes the nearby step's effects are similar.
Some more academical stuff:
DDIM works well with this. its hidden states will change smoothly, we can use nearby weights. LCM won't even work with this. Some schedulers work drastically fast at initial steps, thus we can safely disable for those steps - yes, that's what you see as parameter.
It means whenever the UNet values have to change, the caching will mess up.
But, I guess for training, this can be kinda useful - we can force model to denoise with cache assumption? (Meanwhile HyperTile is already useful for training)
@gel-crabs Yeah, most of the U-Net forward hijacking functions won't work with this, It assumes the nearby step's effects are similar.
Some more academical stuff:
DDIM works well with this. its hidden states will change smoothly, we can use nearby weights. LCM won't even work with this. Some schedulers work drastically fast at initial steps, thus we can safely disable for those steps - yes, that's what you see as parameter.
It means whenever the UNet values have to change, the caching will mess up.
But, I guess for training, this can be kinda useful - we can force model to denoise with cache assumption? (Meanwhile HyperTile is already useful for training)
The deepcache is available on Webui now? How can we use it ?
@bigmover https://github.com/aria1th/sd-webui-deepcache-standalone Use the extension please, and note that it can't be used with controlnet / some other specific unet hijacking extensions