stable-diffusion-webui
stable-diffusion-webui copied to clipboard
Add consistency Decoder to VAE options.
Description
Consistency Decoder: https://github.com/openai/consistencydecoder I added it Just like TAESD.
It require lot of resource to "decode" the image (it is actually a latent guided consistency model on pixel space directly). So we may want to implement some tile method for it. But directly tile may not be a good idea. And mathmetically identical tile algorithm require pytorch implementation(I'm testing it but it is not successful)
may require more dev on this, but since it is wokring as it should be for now. I PR it.
Checklist:
- [x] I have read contributing wiki page
- [x] I have performed a self-review of my own code
- [x] My code follows the style guidelines
- [x] My code passes tests
It looks pretty much useless. For decoding 1024x1024 it consumes 26 GB VRAM and its 10+ times slower than regular VAE. In ClosedAI examples they even use 256x256... At least for anime at 1024x1024 this is produce image even slightly worse than 840k.
For anime models it seems like a clear downgrade:
VAE, took 8.9s to generate:
Consistency decoder, took 22.6s:
(although the bow on the girl in 4th pic seems more consistent)
Here's for a normal photo generation:
VAE, 13.9 sec. A: 5.60 GB, R: 6.90 GB, Sys: 9.4/24 GB (39.2%)
Consistency decoder, 46.5 sec. A: 13.79 GB, R: 22.14 GB, Sys: 24.0/24 GB (100.0%)
Photo with a skyscraper:
VAE, 13.8 sec. A: 5.60 GB, R: 6.90 GB, Sys: 7.4/24 GB (30.7%):
Consistency decoder, 34.8 sec. A: 13.80 GB, R: 22.14 GB, Sys: 24.0/24 GB (100.0%)
Do you have any examples with more distant photorealistic faces with consistency decoder instead? I'd be willing to wait the extra time to finally get some decent mid range faces without adetailer. I have to say also, on the anime example the colors looking way better on consistency decoder on oled screen here.
VAE:
Consistency decoder:
This is on one of new cool kids models; it works better at larger resolution, but I did at 768x512 to invoke weird faces.
As for anime colors, this is a NAI VAE thing. Here's the same picture using vae-ft-mse-840000-ema-pruned.ckpt without consistency decoder:
Also you can check out this PR and make your own for testing purposes.
Anime is going to look terrible with anything other than animevae
because NAI finetuned the VAE. Eyes in particular turn out worse with any other VAE, as the examples posted show. And the washed out color is a simple postprocessing fix.
As for this VAE's intended purpose of replacing the stock SD VAE: It's certainly better than the one included in the base SD checkpoints (from CompVis), but if it's better than the ones StabilityAI later finetuned I'd say is situational. The example posted here of the cake might be the only one that looks nicer with Consistency imo. You have to a/b to make out any difference but the details look more organic whereas with a normal GAN VAE it looks more like noisy predictions.
I've tested out this PR myself the last couple days and the examples shown here align with my tests as well. Don't feel the need to post any more comparisons personally.
Thanks for those examples. Honestly, it kinda looks a bit worse compared to just VAE, Was pretty hyped about the consistency encoder, but I guess my excitement's cooled off a bit now.
some updates here:
This PR will never be merged but I will left it opened until we finish the change of latent-related things.
The plan is I will make some BaseClass for Latent decode/encode/process, which let extension can add their own latent process easily. And then I will make a example extension which use Consistency Decoder to run decode. Once I done it I will close this PR
closing; reopen if needed