Latent downscaling
I have a project for which I would need latent downscaling or resizing. If you have everything set up, could you quickly train a downscaler from 128x128x4 to 64x64x4 for SDXL's VAE? Otherwise, I am going to do it myself.
Best, Chris
I don't really have a working version of this repo locally but I think it should be simple enough for me to get it working again lol. I'll preprocess the latent dataset over night and train one tomorrow if that works.
Any reason why you'd want 1024->512 native? This should already generalize well since it's a convnet so 512->256 is probably way easier to train.
Haha amazing! Let me know how that goes :)
I trained CLIP models on latent images in SDXL-Turbo's latent resolution (paper here: https://arxiv.org/abs/2503.08455, we also have a demo-repo). When trying to evaluate on latents generated by SDXL we noticed that it is not straightforward to resize latents efficiently.
Ideally, I'd like to be able to resize all (or at least the most common latent resolutions of SDXL & SDXL derived workflows) resolutions to 64x64x4, which is the native resolutions of my latent CLIP models.
Well, it's not great. The current network seems really only decent for upscaling, probably because the nn.Upsample is right at the start, which in this case acts as a downscale, which in turn mangles the latent format - at least that's my best guess, could also be that the training code is broken somewhere since the version I got working is a weird mix between this repo and the latent interposer one.
Also you have the issue of input size, this would be static x0.5 so 128x128 -> 64x64 would work but 160x96 would map to 80x48 unless you center crop/pad the input latent (which in turn adds border artifacts).
Almost wonder if it'd be worth redoing this repo/project with a modern-ish model arch and seeing if it can be improved.
Could you share this model so we can evaluate it? I guess it does not look great yet but it's a good start!
Here are the kind of results that we got by applying image resizing techniques in the latent domain as a reference.
Yeah maybe we'd want to change the architecture a bit and do the down-scaling after increasing the channel dimension and doing a few layers with the increased channel dimension. But also not sure about what's the right choice.
Could you share this model so we can evaluate it?
Sure, uploaded it here - this is the one that matches the original arch mentioned above so it should be loadable with the current code or with the comfy node if you modify where it looks for it.
Yeah maybe we'd want to change the architecture a bit and do the down-scaling after increasing the channel dimension and doing a few layers with the increased channel dimension.
That's what I was thinking. I'll probably have to rewrite the training code because I'm not even super convinced the current one is calculating the eval loss correctly lol.
I also trained one such resizer, but somehow my VAE-decoded small latents (including ground truth) look suspicious.
It's here https://github.com/wendlerc/latent_downscaling/tree/main