diffusers Latent upscaler

WIP.

The upscaler model does not predict the noise, but the denoised image. Therefore it needs to use schedulers that support predict_epsilon=False. In addition, the noise schedule is configured from loglinear intervals on the sigmas, not the betas. I'm currently using EulerDiscreteScheduler for testing: I added support for predict_epsilon to it, and it has a sigmas property that the code sets. We need to make sure we can instantiate schedulers appropriately (with a custom beta schedule, not with the sigmas).

I haven't converted the upscaler model and weights to diffusers yet. The current code downloads the weights in their original format from a temporary repo at https://huggingface.co/pcuenq/k-upscaler. That repo also includes one of the fine-tuned VAEs that were released recently. This makes it easier to compare with the original code. The repo also has copies of the Stable Diffusion text encoder, tokenizer and safety checker.

TODO:

[ ] Convert upscaler model to diffusers.
[ ] Come up with a sensible way to initialize the schedulers.
[ ] Simplify the code and make it follow our style. It currently uses some helper classes copied from the original.
[ ] Make other pipelines support latents output. It only works for StableDiffusionPipeline right now.
[ ] End-to-end community pipeline that composes StableDiffusionPipeline + StableDiffusionUpscalerPipeline

How to test:

from diffusers import StableDiffusionPipeline, StableDiffusionUpscalerPipeline

# Temporary repo. It differs from Stable Diffusion in just a few things:
# - It uses a different scheduler (EulerDiscreteScheduler for now)
# - It uses the fine-tuned VAE
# - It uses a custom upscaler (not yet converted to Hugging Face diffusers)
upscaler_id = "pcuenq/k-upscaler"
device = "cuda"

text2img = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5").to(device)
upscaler = StableDiffusionUpscalerPipeline.from_pretrained(upscaler_id).to(device)

prompt = "Labrador in the style of Vermeer"
latents = text2img(prompt, output_type="latents").images
image_1024 = upscaler(latents, prompt).images[0]
image_1024.save("test_1024.png")

Questions

Should we create a separate model repo for this pipeline, or push the new components to the stable diffusion one(s)? The new components are: . The upscaler. It's another unet, but we need to call it something else like upscaler. . The upscaler scheduler.

I think it's cleaner to use a new model repo, not sure if under stabilityai or Katherine's personal profile (if she has one).

Nov 16 '22 20:11 pcuenca

I've started looking at the UNet conversion and I think it requires some code changes in diffusers, unless I'm missing something. These are my notes so far:

The upscaler uses the gelu activation function. This is easy enough to add here https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/resnet.py#LL416-L421
One of the blocks in the UNet uses self attention and cross attention. See self_attn_depths and cross_attn_depths in https://models.rivershavewings.workers.dev/config_laion_text_cond_latent_upscaler_2.json, and the code here: https://github.com/crowsonkb/k-diffusion/blob/f4e99857772fc3a126ba886aadf795a332774878/k_diffusion/models/image_v1.py#LL37-L42. Is this something we support?
There are several types of conditioning. Following the terminology in k-diffusion:
- unet_cond are the low resolution latents. They are fed into the network as 4 additional channels. This is doable, as in the in-painting pipeline.
- cross_cond is the usual conditioning on text embeddings.
- mapping_cond is special. It comes from the pooler output of CLIP, with the low_res_noise_embed concatenated. low_res_noise_embed is intended to be used when we add noise to the downscaled latents to make them more in-distribution, in theory. But according to the original code it doesn't work too well so it's usually set to zero.
- timestep_embed is used in this model to carry the sigma of the noise added at each step of the process.

A dictionary is prepared with those conditioning signals and the following keys: cond (contains a projection of timestep_embed + mapping_cond_embed), cross contains the text embeddings, and cross_padding contains the text attention masks (actually 1 - masks, I think).

The interesting thing is that this dictionary is passed to the forward method of all the modules in the UNet. Some of them use it, others don't. The ones that use it seem to be:

AdaGN a special group normalization layer that uses key cond: https://github.com/crowsonkb/k-diffusion/blob/f4e99857772fc3a126ba886aadf795a332774878/k_diffusion/layers.py#L97
SelfAttention2d uses cond in its own norm layers too.
CrossAttention2d uses cross and cross_padding. We only use the equivalent of cross, I think?

In addition to that, I think we'll also need a new model class that prepares the inputs and applies some projections. Not sure if we'll need to create a new version of UNet2DConditionModel or maybe just a wrapper.

TL;DR: is this something we usually encounter when converting models, or am I following the wrong direction here?

Nov 17 '22 14:11 pcuenca

I've started looking at the UNet conversion and I think it requires some code changes in diffusers, unless I'm missing something. These are my notes so far:

The upscaler uses the gelu activation function. This is easy enough to add here https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/resnet.py#LL416-L421

One of the blocks in the UNet uses self attention and cross attention. See self_attn_depths and cross_attn_depths in https://models.rivershavewings.workers.dev/config_laion_text_cond_latent_upscaler_2.json, and the code here: https://github.com/crowsonkb/k-diffusion/blob/f4e99857772fc3a126ba886aadf795a332774878/k_diffusion/models/image_v1.py#LL37-L42. Is this something we support?

There are several types of conditioning. Following the terminology in k-diffusion:

unet_cond are the low resolution latents. They are fed into the network as 4 additional channels. This is doable, as in the in-painting pipeline.

cross_cond is the usual conditioning on text embeddings.

mapping_cond is special. It comes from the pooler output of CLIP, with the low_res_noise_embed concatenated. low_res_noise_embed is intended to be used when we add noise to the downscaled latents to make them more in-distribution, in theory. But according to the original code it doesn't work too well so it's usually set to zero.

timestep_embed is used in this model to carry the sigma of the noise added at each step of the process.

A dictionary is prepared with those conditioning signals and the following keys: cond (contains a projection of timestep_embed + mapping_cond_embed), cross contains the text embeddings, and cross_padding contains the text attention masks (actually 1 - masks, I think).

The interesting thing is that this dictionary is passed to the forward method of all the modules in the UNet. Some of them use it, others don't. The ones that use it seem to be:

AdaGN a special group normalization layer that uses key cond: https://github.com/crowsonkb/k-diffusion/blob/f4e99857772fc3a126ba886aadf795a332774878/k_diffusion/layers.py#L97

SelfAttention2d uses cond in its own norm layers too.

CrossAttention2d uses cross and cross_padding. We only use the equivalent of cross, I think?

In addition to that, I think we'll also need a new model class that prepares the inputs and applies some projections. Not sure if we'll need to create a new version of UNet2DConditionModel or maybe just a wrapper.

TL;DR: is this something we usually encounter when converting models, or am I following the wrong direction here?

Hey @pcuenca,

Yes, I think we should adapt the UNet2DConditionModel here to fit the use case. I think it'd make sense in this PR to first do a first hacky adaption of UNet2DConditionModel to make it work and then we'll make it clean afterwards. Regarding:

mapping_cond do we need this in a first iteration or can we just leave it to 0?
low_res_noise_embed let's leave it at 0

Nov 18 '22 09:11 patrickvonplaten

Let me know once you'd like a final review :-)

Nov 29 '22 12:11 patrickvonplaten

@patrickvonplaten @pcuenca Can I work on this?

I went through this model during my break and want to give it a try. If it's ok with you I will open a new PR :)

Dec 31 '22 21:12 yiyixuxu

@yiyixuxu That sounds great to me if @patrickvonplaten agrees! I wrote some notes above after I looked into this, but essentially these are in my mind the main parts to this:

Special log-linear scheduler. I wrote code to generate the sigmas, but we could do without it by simply saving the generated sigmas in the config file as a numpy array.
UNet changes: we've done several modifications to the UNet since I looked into this, so it might be better prepared for this now.
Conditioning channels.
Return the output as raw latents using a special output type.

I'm really sorry I didn't follow through with this issue, but I'm happy to help if you take it!

(Reopening for discussion, feel free to close if you open a new one).

Jan 02 '23 08:01 pcuenca

It would be great if you could take this one @yiyixuxu !

Let @pcuenca or me know if you have any questions!

Jan 02 '23 13:01 patrickvonplaten

Thanks @pcuenca @patrickvonplaten! I'm excited to work on this.

I will start with the Unet. I've started to go through UNet2DConditionModel in detail and will compile a list of changes that need to be made as a starting point

How can I learn more about converting checkpoints? Is this the script I should study and try to adapt? https://github.com/huggingface/diffusers/blob/main/scripts/convert_original_stable_diffusion_to_diffusers.py

Jan 03 '23 04:01 yiyixuxu

so far I've found 4 changes we need to make to UNet2DConditionModel. listing here and I will go over them in detail:

the process to create emb
the pre-processing step i.e. self.conv_in
no middle block
post-processing step

(1) The biggest change we need to make is the process to create emb.

as @pcuenca pointed out, the upscaler prepares a special conditional embedding cond , which contains a projection of timestep_embed + mapping_cond

timestep_embed conditioning on the sigma of noise added in forward diffusion; mapping_cond is the upscaler's special conditioning on the text (but it uses the pooler output of the CLIP so it's different than the encoder_hidden_states) and the noise added to the low resolution latent

I think cond is the equivalence of the timestep embedding emb in our UNet2DConditionModel, and we can use the same process to create it with a few modifications (note thatmapping_cond will need to be created in the wrapper model and passed to unet directly as class_label)

As a reference, in UNet2DConditionModel, the process to create emb is

timestep -> self.time_proj()  -> self.time_embedding() -> emb
class_label -> self.class_embedding() ---------------➕⤴

we should pass sigmas directly as timestep(this does not require code change in unet)
support GaussianFourierProjection for self.time_proj layer
- Currently we use sinusodial position embedding (the Timesteps class) to encode timesteps, @crowsonkb use a Fourier Features layer for upscaler and I think it is already implemented in diffusers (the GaussianFourierProjection class), so we should support both in unet
the self.time_embedding layer in our unet is linear->act->linear; its equivalence in upscaler is linear->gelu->linear->gelu : we need to make some adjustments in the embeddings.TimestepEmbedding class accordingly: support gelu as act_fn, and add an optional act2 layer
we can pass mapping_cond as class_labels
- in unet, we put class_labels through its own projection layer and then add it to the emb, i.e. time_embedding(t_emb) + class_embedding(class_label)
- in upscaler, we add the embeddings before passing through the projection layer together - some pseudo code will be time_embedding(t_emb + class_label))
- for now to make it work, we can just set self.class_embedding layer to be same as self.time_embedding in unet, it is not efficient but mathematically equivalent

(2) the pre-processing step i.e. self.conv_in

the conv_in in upscaler is 1x1 conv but in unet it's 3x3, I think the easiest way to address the difference is to allow the user to pass an already pre-processed sample and skip this layer - i.e. do something like this

if sample.shape[1] == self.in_channels:
   sample = self.conv_in(sample)

(3) there is no middle block in upscaler, so we should allow that in unet, we can just skip it if mid_block_type = None

(4) the post-processing step in upscaler it is a conv 1x1, in unet it's

    sample = self.conv_norm_out(sample)
    sample = self.conv_act(sample)
    sample = self.conv_out(sample)

we should allow skipping this step? will have to create an input argument for this though

Let me know your thoughts! @patrickvonplaten @pcuenca

I'm going to compare the resnet, self-attention, and cross-attention blocks next, I think we probably need to create a special KResnetBlock2D and we definitely need to create a new down_block_type and up_block_type but from what I understand, these won't require changes in the UNet2DConditionModel itself

Jan 04 '23 17:01 yiyixuxu

Thanks a lot for the summary @yiyixuxu!

Regarding: (1) - your plan sounds great (2) - I see what your idea is here :-) I think that's a nice idea, but if the upscaling models also needs a convolution layer + weights it should/has to be stored inside UNet2DConditional. But if just the kernel is different, it could be as simple as creating a different conv layer here: https://github.com/huggingface/diffusers/blob/7101c7316b6f6d3f4e578f29c108533cb678a304/src/diffusers/models/unet_2d_condition.py#L133 (e.g. setting the kernel to 1 instead of 3?) (3) - yes perfect! (4) - Here again I think we work with if-else statements, e.g.

if self.conv_norm_out is not None:
    sample = self.conv_norm_out(sample)
    sample = self.conv_act(sample)

sample = self.conv_out(sample)

does this make sense?

Jan 05 '23 13:01 patrickvonplaten

yes it make sense. Thanks! @patrickvonplaten I had this goal in mind that I want to change as little code as possible - I guess what we actually want here is to make the Unet more flexible and can be configured to adapt to a wider range of use cases?

Jan 06 '23 00:01 yiyixuxu

@patrickvonplaten

One more summary/questions and I think I'm ready to start implementing this :)

Here are 3 basic abstractions that compose the downsample and upsample blocks in upscaler Unet, and their closest counterpart in diffusers( I'm using 🔶 to indicate conv2d, 🔴 for self-attention and 🍎 for cross-attention)

🔶🔶 ResConvBlock ~ ResnetBlock2D ( in 🧨)
🔴 SelfAttention2d ~ CrossAttention ( in 🧨)
🍎 CrossAttention2d ~ CrossAttention (in 🧨)

the attention blocks are quite similar, but resnet blocks are a little bit different. Do we want to adapt or do we want to create a new ResnetBlock2D?

Here is the comparison of the diffuser ResnetBlock2D and upscaler ResConvBlock:

ResnetBLock2D 🧨
     ↗---------> skip (Identity or conv) -----------------------------------------↘ 
input -> norm -> act-> conv1🔶 -> norm -> scale_shift -> act -> dropout -> conv2🔶 -> ➕ -> input
                                         ⬆
                      emb -> act -> linear


ResConvBlock (upscaler)

     ↗-----------------------------------------------> skip  -----------------------------------------------↘ 
input-> norm -> scale_shift -> gelu -> conv1🔶 -> dropout -> norm -> scale_shift-> gelu -> conv2🔶 -> dropout  ➕ -> input
                ⬆                                                ⬆                                
      emb -> linear                                     emb -> linear

each block contains multiple layers of the same structure, just like in UNet2DConditionModel, we have 3 types of layers(I think their unet allows all possible combinations of resnet, self-attention, and cross attention blocks, but only these 3 are used in the config for the pre-trained model)

🔶🔶ResConvBlock
🔶🔶ResConvBlock -> 🍎CrossAttention2d
🔶🔶ResConvBlock -> 🔴SelfAttention2d -> 🍎CrossAttention2d

I think we can use DownBlock2 for the first one, and we can use SimpleCrossAttnDownBlock2D for the second one ?

I don't think we have an existing down_block_type/ up_block_typethat includes resnet, self-attention and cross attention - should we adapt SimpleCrossAttnDownBlock2D or should we create a new block type?

Jan 06 '23 20:01 yiyixuxu

Hi @yiyixuxu, great analysis!

I felt the same concern about modifying the UNet, but Patrick clarified that it's perfectly fine to make it more flexible as new needs arise :)

Regarding your last question about the down/up blocks, my initial instinct would be to make it work using the simplest code we can (even with hardcoded stuff), and then decide whether it makes sense to incorporate the logic inside the existing blocks. So personally I'd create separate blocks for now, check that the model outputs match for the same set of inputs, and then study the code differences. This would be just my personal approach to tackle this, you might prefer to follow a different path :)

Jan 08 '23 11:01 pcuenca

thanks @pcuenca I think it's an excellent suggestion! I'd like to use the existing API for blocks but I can wait to do that after I get everything working:)

Jan 08 '23 21:01 yiyixuxu

@yiyixuxu to begin with I'd try to get everything working by adapting existing classes. If we see then in the PR that things are becoming too different, then we could still change it afterward :-)

Jan 12 '23 19:01 patrickvonplaten

@patrickvonplaten I've been creating new blocks for K-upscaler but they are all adapted from existing classes and I've been writing them in a way that it should be easy to merge in if we decide to do so.

It stressed me a little bit to change the existing API because I don't know well enough how other models use these blocks, and also worried that too many if-else would make the code too complex and hard to read - that's something I need more guidance and so I would feel a lot more comfortable to have a review first. UNet should be ready for review early next week :)

Jan 12 '23 19:01 yiyixuxu

Hey @yiyixuxu,

It's absolutely fine to adapt existing code and potentially break a use case. I would recommend to at first adapt existing code to whatever you need to make it work (not thinking much about potentially breaking something) and once you make it work, we can review and correct things.

Design questions really only start to come into play once we have a working (potentially hacky/dirty) reference implementation that works :-)

Jan 13 '23 13:01 patrickvonplaten