stable-diffusion.cpp icon indicating copy to clipboard operation
stable-diffusion.cpp copied to clipboard

Image preview

Open stduhpf opened this issue 1 year ago • 1 comments

Forked off https://github.com/leejet/stable-diffusion.cpp/pull/454 Would also probably replace https://github.com/leejet/stable-diffusion.cpp/pull/416

  • Move the preview decoding logic from examples/cli/main.cpp to stable-diffusion.cpp
  • Image preview is disabled by default
  • Adds possibility to chose between previewing image with latent projection (as demonstrated in #454), TAE, or VAE
  • Adds possibility to load TAE for preview only (decode final image with VAE)
  • Default image preview path is preview.png

Related to #354, if the user uses an image viewer that updates its render when the image file changes, then it's possible to see the progress in real time.

stduhpf avatar Dec 13 '24 18:12 stduhpf

+1 for this in main, great work!

wandbrandon avatar Feb 17 '25 19:02 wandbrandon

I really would like to get this on master, problem i used it a lot but it doesn't build against current master.

phil2sat avatar Oct 13 '25 08:10 phil2sat

Thanks for the heads up, I'll try to look into resolving these conflicts this week.

stduhpf avatar Oct 13 '25 17:10 stduhpf

@phil2sat it builds now. I have only tried with sd1.5 so far, but it seems to work fine.

stduhpf avatar Oct 15 '25 08:10 stduhpf

Works with qwen image (looks like qwen-image's VAE is almost identical to wan2.1's, not just the architecture, but the latent space too)

stduhpf avatar Oct 15 '25 12:10 stduhpf

Can confirm, works. Even with qwen-image.

Huge Problem is progress on master is so fast, i had to manually patch it to merge, only a couple of hours later.

@leejet, please add this feature, its so essential for testing, debugging. On new models, samplers, schedulers i always watch the generation live, so i can exactly see on what step goes something wrong.

Actually i have a problem with qwen-image-edit on ComfyUI. With step-5 something gets weird with my workflow, without preview this is impossible to track.

I know with a 5K $ GPU just wait some seconds, but if every step needs 30s, every preview shows when to abort generation to safe power and time.

phil2sat avatar Oct 16 '25 05:10 phil2sat

When the pending issue is resolved, I will merge this pull request.

leejet avatar Oct 16 '25 13:10 leejet

@leejet, please...

phil2sat avatar Oct 19 '25 18:10 phil2sat

Before this is merged, should I rename the "proj" preview method to "latent2rgb" like it's called in ComfyUI?

stduhpf avatar Oct 25 '25 19:10 stduhpf

I think the naming doesn’t really matter. Once the potential license issue I mentioned in the review comments is resolved, this PR can be merged.

leejet avatar Oct 26 '25 03:10 leejet

I think the naming doesn’t really matter. Once the potential license issue I mentioned in the review comments is resolved, this PR can be merged.

@leejet , your comments aren't showing up for me. But I guess you could be referring to where the projection matrices come from?

wbruna avatar Oct 26 '25 11:10 wbruna

I think the naming doesn’t really matter. Once the potential license issue I mentioned in the review comments is resolved, this PR can be merged.

@leejet , your comments aren't showing up for me. But I guess you could be referring to where the projection matrices come from?

I'm not seeing them either, I was very confused.

stduhpf avatar Oct 26 '25 11:10 stduhpf

But I guess to avoid any licensing issues I could just train the projection matrices myself. It kinda feels like reinventing a perfectly working wheel though.

stduhpf avatar Oct 26 '25 13:10 stduhpf

It could be argued that the matrices are just the product of an algorithm (training, a simple least-squares approximation, etc), and thus not restricted by copyright.

The problem is the "arguing" part 😕 Even if that argument is sound (and I personally believe it is), sidestepping the issue through an independent implementation would completely avoid that kind of headache.

wbruna avatar Oct 26 '25 13:10 wbruna

My original review comment:

There may be potential licensing issues here: ComfyUI is under the GPL license, while sd.cpp is under the MIT license. Unless it can be proven that this data is not exclusive to ComfyUI and instead comes from a permissively licensed source, there could be conflicts. For example, the mean/std values in Wan2.2 come directly from the official Wan2.2 repository (https://github.com/Wan-Video/Wan2.2/blob/main/wan/modules/vae2_2.py#L904), which is licensed under the Apache License 2.0.

leejet avatar Oct 26 '25 14:10 leejet

As far as I know, algorithms themselves are not protected by copyright law — only the specific source code implementations are. Therefore, rewriting the Python code in C++ does not trigger the GPL restrictions. However, directly copying data embedded in the original code may fall under the GPL if that data is original or creative in nature. If the data consists solely of factual or non-creative information, then it is generally not subject to copyright protection and thus not restricted by the GPL.

leejet avatar Oct 26 '25 14:10 leejet

SD3's projection was taken directly from the official inference code (MIT). For the others I'm pretty the data is distilled from the VAEs. I don't think it counts as "creative", but if we really want to be extra safe, we could re-train them. As far as I know, ComfyUI doesn't say where these weights come from.

stduhpf avatar Oct 26 '25 15:10 stduhpf

Ok I'm doing it, It will take some time to get them for all supported VAEs because my process might not be the most efficient, but here are the weight I came up with for sd1 VAE already:

const float sd_latent_rgb_proj[4][3] = {
    {0.303418f, 0.205030f, 0.223200f},
    {0.158560f, 0.272113f, 0.092085f},
    {-0.229890f, 0.170979f, 0.213735f},
    {-0.155664f, -0.226876f, -0.498111f}};
float sd_latent_rgb_bias[3] = {-0.054481f, -0.125704f, -0.211548f};

My way of training it is to get a very large set of various images (my output folder), encode them with vae to get the latents while also downscaling them (using the RMS value of the 8x8 patches for gamma-correctish downscaling), and putting the average RGB and latent channels in a very big CSV (>900 MB). Then I take a large random sample of these rows, biased towards the more saturated colors (otherwise the previews are washed out), and do a least square regression with bias.

Here it is compared to the one from ComfyUI (labeled as old) and a "ground truth" downscaling of the decoded image (that is probably not achievable with a simple affine regression like this one)

old projection downscaled original new projection
old8x downscaled8x new8x

Full res original: output

Edit: The results I'm getting with SDXL VAE aren't as good for some reason (visibly worse than ComfyUI's, but still usable). Flux one seems good.

stduhpf avatar Oct 27 '25 00:10 stduhpf

Ok I updated all latent to RGB projections except for sd3.x.

Only SDXL projection feels like a small downgrade, everything else seems about on par or better than the previous version.

I trained Wan's 21 and 2.2 proj on still images only, but it seems to handle motion fine (not perfect but good enough for now).

stduhpf avatar Oct 28 '25 14:10 stduhpf

@leejet Is there anything left you'd like me to change? I feel like it's pretty much ready.

stduhpf avatar Oct 30 '25 15:10 stduhpf

Thank you for your contribution. I will find time to review and merge this PR.

leejet avatar Nov 03 '25 13:11 leejet