Latent upscaler
WIP.
The upscaler model does not predict the noise, but the denoised image. Therefore it needs to use schedulers that support predict_epsilon=False. In addition, the noise schedule is configured from loglinear intervals on the sigmas, not the betas. I'm currently using EulerDiscreteScheduler for testing: I added support for predict_epsilon to it, and it has a sigmas property that the code sets. We need to make sure we can instantiate schedulers appropriately (with a custom beta schedule, not with the sigmas).
I haven't converted the upscaler model and weights to diffusers yet. The current code downloads the weights in their original format from a temporary repo at https://huggingface.co/pcuenq/k-upscaler. That repo also includes one of the fine-tuned VAEs that were released recently. This makes it easier to compare with the original code. The repo also has copies of the Stable Diffusion text encoder, tokenizer and safety checker.
TODO:
- [ ] Convert upscaler model to diffusers.
- [ ] Come up with a sensible way to initialize the schedulers.
- [ ] Simplify the code and make it follow our style. It currently uses some helper classes copied from the original.
- [ ] Make other pipelines support
latentsoutput. It only works forStableDiffusionPipelineright now. - [ ] End-to-end community pipeline that composes
StableDiffusionPipeline+StableDiffusionUpscalerPipeline
How to test:
from diffusers import StableDiffusionPipeline, StableDiffusionUpscalerPipeline
# Temporary repo. It differs from Stable Diffusion in just a few things:
# - It uses a different scheduler (EulerDiscreteScheduler for now)
# - It uses the fine-tuned VAE
# - It uses a custom upscaler (not yet converted to Hugging Face diffusers)
upscaler_id = "pcuenq/k-upscaler"
device = "cuda"
text2img = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5").to(device)
upscaler = StableDiffusionUpscalerPipeline.from_pretrained(upscaler_id).to(device)
prompt = "Labrador in the style of Vermeer"
latents = text2img(prompt, output_type="latents").images
image_1024 = upscaler(latents, prompt).images[0]
image_1024.save("test_1024.png")
Questions
- Should we create a separate model repo for this pipeline, or push the new components to the stable diffusion one(s)? The new components are:
. The upscaler. It's another
unet, but we need to call it something else likeupscaler. . The upscaler scheduler.
I think it's cleaner to use a new model repo, not sure if under stabilityai or Katherine's personal profile (if she has one).
I've started looking at the UNet conversion and I think it requires some code changes in diffusers, unless I'm missing something. These are my notes so far:
- The upscaler uses the
geluactivation function. This is easy enough to add here https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/resnet.py#LL416-L421 - One of the blocks in the UNet uses self attention and cross attention. See
self_attn_depthsandcross_attn_depthsin https://models.rivershavewings.workers.dev/config_laion_text_cond_latent_upscaler_2.json, and the code here: https://github.com/crowsonkb/k-diffusion/blob/f4e99857772fc3a126ba886aadf795a332774878/k_diffusion/models/image_v1.py#LL37-L42. Is this something we support? - There are several types of conditioning. Following the terminology in k-diffusion:
unet_condare the low resolution latents. They are fed into the network as 4 additional channels. This is doable, as in the in-painting pipeline.cross_condis the usual conditioning on text embeddings.mapping_condis special. It comes from the pooler output of CLIP, with thelow_res_noise_embedconcatenated.low_res_noise_embedis intended to be used when we add noise to the downscaled latents to make them more in-distribution, in theory. But according to the original code it doesn't work too well so it's usually set to zero.timestep_embedis used in this model to carry the sigma of the noise added at each step of the process.
A dictionary is prepared with those conditioning signals and the following keys: cond (contains a projection of timestep_embed + mapping_cond_embed), cross contains the text embeddings, and cross_padding contains the text attention masks (actually 1 - masks, I think).
The interesting thing is that this dictionary is passed to the forward method of all the modules in the UNet. Some of them use it, others don't. The ones that use it seem to be:
AdaGNa special group normalization layer that uses keycond: https://github.com/crowsonkb/k-diffusion/blob/f4e99857772fc3a126ba886aadf795a332774878/k_diffusion/layers.py#L97SelfAttention2dusescondin its own norm layers too.CrossAttention2dusescrossandcross_padding. We only use the equivalent ofcross, I think?
In addition to that, I think we'll also need a new model class that prepares the inputs and applies some projections. Not sure if we'll need to create a new version of UNet2DConditionModel or maybe just a wrapper.
TL;DR: is this something we usually encounter when converting models, or am I following the wrong direction here?
I've started looking at the UNet conversion and I think it requires some code changes in diffusers, unless I'm missing something. These are my notes so far:
The upscaler uses the
geluactivation function. This is easy enough to add here https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/resnet.py#LL416-L421One of the blocks in the UNet uses self attention and cross attention. See
self_attn_depthsandcross_attn_depthsin https://models.rivershavewings.workers.dev/config_laion_text_cond_latent_upscaler_2.json, and the code here: https://github.com/crowsonkb/k-diffusion/blob/f4e99857772fc3a126ba886aadf795a332774878/k_diffusion/models/image_v1.py#LL37-L42. Is this something we support?There are several types of conditioning. Following the terminology in k-diffusion:
unet_condare the low resolution latents. They are fed into the network as 4 additional channels. This is doable, as in the in-painting pipeline.cross_condis the usual conditioning on text embeddings.mapping_condis special. It comes from the pooler output of CLIP, with thelow_res_noise_embedconcatenated.low_res_noise_embedis intended to be used when we add noise to the downscaled latents to make them more in-distribution, in theory. But according to the original code it doesn't work too well so it's usually set to zero.timestep_embedis used in this model to carry the sigma of the noise added at each step of the process.A dictionary is prepared with those conditioning signals and the following keys:
cond(contains a projection of timestep_embed + mapping_cond_embed),crosscontains the text embeddings, andcross_paddingcontains the text attention masks (actually1 - masks, I think).The interesting thing is that this dictionary is passed to the forward method of all the modules in the UNet. Some of them use it, others don't. The ones that use it seem to be:
AdaGNa special group normalization layer that uses keycond: https://github.com/crowsonkb/k-diffusion/blob/f4e99857772fc3a126ba886aadf795a332774878/k_diffusion/layers.py#L97SelfAttention2dusescondin its own norm layers too.CrossAttention2dusescrossandcross_padding. We only use the equivalent ofcross, I think?In addition to that, I think we'll also need a new model class that prepares the inputs and applies some projections. Not sure if we'll need to create a new version of
UNet2DConditionModelor maybe just a wrapper.TL;DR: is this something we usually encounter when converting models, or am I following the wrong direction here?
Hey @pcuenca,
Yes, I think we should adapt the UNet2DConditionModel here to fit the use case. I think it'd make sense in this PR to first do a first hacky adaption of UNet2DConditionModel to make it work and then we'll make it clean afterwards.
Regarding:
mapping_conddo we need this in a first iteration or can we just leave it to 0?low_res_noise_embedlet's leave it at 0
Let me know once you'd like a final review :-)
@patrickvonplaten @pcuenca Can I work on this?
I went through this model during my break and want to give it a try. If it's ok with you I will open a new PR :)
@yiyixuxu That sounds great to me if @patrickvonplaten agrees! I wrote some notes above after I looked into this, but essentially these are in my mind the main parts to this:
- Special log-linear scheduler. I wrote code to generate the sigmas, but we could do without it by simply saving the generated sigmas in the config file as a numpy array.
- UNet changes: we've done several modifications to the UNet since I looked into this, so it might be better prepared for this now.
- Conditioning channels.
- Return the output as raw latents using a special output type.
I'm really sorry I didn't follow through with this issue, but I'm happy to help if you take it!
(Reopening for discussion, feel free to close if you open a new one).
It would be great if you could take this one @yiyixuxu !
Let @pcuenca or me know if you have any questions!
Thanks @pcuenca @patrickvonplaten! I'm excited to work on this.
I will start with the Unet. I've started to go through UNet2DConditionModel in detail and will compile a list of changes that need to be made as a starting point
How can I learn more about converting checkpoints? Is this the script I should study and try to adapt? https://github.com/huggingface/diffusers/blob/main/scripts/convert_original_stable_diffusion_to_diffusers.py
so far I've found 4 changes we need to make to UNet2DConditionModel. listing here and I will go over them in detail:
- the process to create
emb - the pre-processing step i.e.
self.conv_in - no middle block
- post-processing step
(1) The biggest change we need to make is the process to create emb.
as @pcuenca pointed out, the upscaler prepares a special conditional embedding cond , which contains a projection of timestep_embed + mapping_cond
timestep_embedconditioning on the sigma of noise added in forward diffusion;mapping_condis the upscaler's special conditioning on the text (but it uses the pooler output of the CLIP so it's different than theencoder_hidden_states) and the noise added to the low resolution latent
I think cond is the equivalence of the timestep embedding emb in our UNet2DConditionModel, and we can use the same process to create it with a few modifications (note thatmapping_cond will need to be created in the wrapper model and passed to unet directly as class_label)
As a reference, in UNet2DConditionModel, the process to create emb is
timestep -> self.time_proj() -> self.time_embedding() -> emb
class_label -> self.class_embedding() ---------------➕⤴
- we should pass sigmas directly as
timestep(this does not require code change in unet) - support
GaussianFourierProjectionforself.time_projlayer- Currently we use sinusodial position embedding (the
Timestepsclass) to encode timesteps, @crowsonkb use a Fourier Features layer for upscaler and I think it is already implemented in diffusers (theGaussianFourierProjectionclass), so we should support both in unet
- Currently we use sinusodial position embedding (the
- the
self.time_embeddinglayer in our unet islinear->act->linear; its equivalence in upscaler islinear->gelu->linear->gelu: we need to make some adjustments in theembeddings.TimestepEmbeddingclass accordingly: supportgeluasact_fn, and add an optionalact2layer - we can pass
mapping_condasclass_labels- in unet, we put
class_labelsthrough its own projection layer and then add it to theemb, i.e.time_embedding(t_emb) + class_embedding(class_label) - in upscaler, we add the embeddings before passing through the projection layer together - some pseudo code will be
time_embedding(t_emb + class_label)) - for now to make it work, we can just set
self.class_embeddinglayer to be same asself.time_embeddingin unet, it is not efficient but mathematically equivalent
- in unet, we put
(2) the pre-processing step i.e. self.conv_in
the conv_in in upscaler is 1x1 conv but in unet it's 3x3, I think the easiest way to address the difference is to allow the user to pass an already pre-processed sample and skip this layer - i.e. do something like this
if sample.shape[1] == self.in_channels:
sample = self.conv_in(sample)
(3) there is no middle block in upscaler, so we should allow that in unet, we can just skip it if mid_block_type = None
(4) the post-processing step
in upscaler it is a conv 1x1, in unet it's
sample = self.conv_norm_out(sample)
sample = self.conv_act(sample)
sample = self.conv_out(sample)
we should allow skipping this step? will have to create an input argument for this though
Let me know your thoughts! @patrickvonplaten @pcuenca
I'm going to compare the resnet, self-attention, and cross-attention blocks next, I think we probably need to create a special KResnetBlock2D and we definitely need to create a new down_block_type and up_block_type but from what I understand, these won't require changes in the UNet2DConditionModel itself
Thanks a lot for the summary @yiyixuxu!
Regarding:
(1) - your plan sounds great
(2) - I see what your idea is here :-) I think that's a nice idea, but if the upscaling models also needs a convolution layer + weights it should/has to be stored inside UNet2DConditional. But if just the kernel is different, it could be as simple as creating a different conv layer here: https://github.com/huggingface/diffusers/blob/7101c7316b6f6d3f4e578f29c108533cb678a304/src/diffusers/models/unet_2d_condition.py#L133 (e.g. setting the kernel to 1 instead of 3?)
(3) - yes perfect!
(4) - Here again I think we work with if-else statements, e.g.
if self.conv_norm_out is not None:
sample = self.conv_norm_out(sample)
sample = self.conv_act(sample)
sample = self.conv_out(sample)
does this make sense?
yes it make sense. Thanks! @patrickvonplaten I had this goal in mind that I want to change as little code as possible - I guess what we actually want here is to make the Unet more flexible and can be configured to adapt to a wider range of use cases?
@patrickvonplaten
One more summary/questions and I think I'm ready to start implementing this :)
Here are 3 basic abstractions that compose the downsample and upsample blocks in upscaler Unet, and their closest counterpart in diffusers( I'm using 🔶 to indicate conv2d, 🔴 for self-attention and 🍎 for cross-attention)
- 🔶🔶
ResConvBlock~ResnetBlock2D( in 🧨) - 🔴
SelfAttention2d~CrossAttention( in 🧨) - 🍎
CrossAttention2d~CrossAttention(in 🧨)
the attention blocks are quite similar, but resnet blocks are a little bit different. Do we want to adapt or do we want to create a new ResnetBlock2D?
Here is the comparison of the diffuser ResnetBlock2D and upscaler ResConvBlock:
ResnetBLock2D 🧨
↗---------> skip (Identity or conv) -----------------------------------------↘
input -> norm -> act-> conv1🔶 -> norm -> scale_shift -> act -> dropout -> conv2🔶 -> ➕ -> input
⬆
emb -> act -> linear
ResConvBlock (upscaler)
↗-----------------------------------------------> skip -----------------------------------------------↘
input-> norm -> scale_shift -> gelu -> conv1🔶 -> dropout -> norm -> scale_shift-> gelu -> conv2🔶 -> dropout ➕ -> input
⬆ ⬆
emb -> linear emb -> linear
each block contains multiple layers of the same structure, just like in UNet2DConditionModel, we have 3 types of layers(I think their unet allows all possible combinations of resnet, self-attention, and cross attention blocks, but only these 3 are used in the config for the pre-trained model)
🔶🔶ResConvBlock
🔶🔶ResConvBlock -> 🍎CrossAttention2d
🔶🔶ResConvBlock -> 🔴SelfAttention2d -> 🍎CrossAttention2d
I think we can use DownBlock2 for the first one, and we can use SimpleCrossAttnDownBlock2D for the second one ?
I don't think we have an existing down_block_type/ up_block_typethat includes resnet, self-attention and cross attention - should we adapt SimpleCrossAttnDownBlock2D or should we create a new block type?
Hi @yiyixuxu, great analysis!
I felt the same concern about modifying the UNet, but Patrick clarified that it's perfectly fine to make it more flexible as new needs arise :)
Regarding your last question about the down/up blocks, my initial instinct would be to make it work using the simplest code we can (even with hardcoded stuff), and then decide whether it makes sense to incorporate the logic inside the existing blocks. So personally I'd create separate blocks for now, check that the model outputs match for the same set of inputs, and then study the code differences. This would be just my personal approach to tackle this, you might prefer to follow a different path :)
thanks @pcuenca I think it's an excellent suggestion! I'd like to use the existing API for blocks but I can wait to do that after I get everything working:)
@yiyixuxu to begin with I'd try to get everything working by adapting existing classes. If we see then in the PR that things are becoming too different, then we could still change it afterward :-)
@patrickvonplaten I've been creating new blocks for K-upscaler but they are all adapted from existing classes and I've been writing them in a way that it should be easy to merge in if we decide to do so.
It stressed me a little bit to change the existing API because I don't know well enough how other models use these blocks, and also worried that too many if-else would make the code too complex and hard to read - that's something I need more guidance and so I would feel a lot more comfortable to have a review first. UNet should be ready for review early next week :)
Hey @yiyixuxu,
It's absolutely fine to adapt existing code and potentially break a use case. I would recommend to at first adapt existing code to whatever you need to make it work (not thinking much about potentially breaking something) and once you make it work, we can review and correct things.
Design questions really only start to come into play once we have a working (potentially hacky/dirty) reference implementation that works :-)