diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

[Refactor] Stable Diffusion UNet Refactor Proposal

Open DN6 opened this issue 1 year ago • 6 comments

What does this PR do?

The The UNet2DConditionModel has become quite large is starting to become quite difficult to understand and maintain. In order to simplify this a bit, we can try to break it up into smaller, model specific UNets. This is a draft of the new proposed design.

Some design considerations:

  1. This UNet is designed to work with the following model types: Sd 1.5 and SDXL (including inpainting), Stable UnCLIP models.
  2. Parameters that are not relevant to these model types have been removed. We can further reduce the number of parameters of this UNet if we decide that SDXL and Stable UnCLIP should get their own UNets.

Known issues with this design:

  1. Any SD Pipeline using an Asymmetric UNet i.e. uses the reverse_transformer_layers_per_block parameter. Segmind UNets fall under this category and I think they should get their own UNet.
  2. UNet that doesn't use a Cross Attn Midblock or that does not have a mid block. IMO I think we can add the option to support multiple midblock types/no midblock.
  3. GLIGEN UNet that relies on additional model components. This should be it's own UNet.
  4. UNets with dual cross attention (these have almost no downloads)
  5. UNets that use class_embedding_concat . There isn't much usage for this parameter. And class embeddings are mainly used in the Stable UnCLIP UNet that just adds them to the time embedding

TODOs:

  1. I think the time embedding logic can be cleaned up a bit. It's a little confusing
  2. ControlNet and Adapter logic is still in this UNet. We have to decide how to handle them (whether to remove or not)
  3. PEFT backend checks are in there, but should be removed. Ideally this new design will be released after we deprecate the old lora backend.

Fixes # (issue)

Before submitting

  • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [ ] Did you read the contributor guideline?
  • [ ] Did you read our philosophy doc (important for complex PRs)?
  • [ ] Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
  • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • [ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

DN6 avatar Jan 04 '24 08:01 DN6

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Any SD Pipeline using an Asymmetric UNet i.e., uses the reverse_transformer_layers_per_block parameter. Segmind UNets fall under this category and I think they should get their own UNet.

I am not okay with this philosophy. SSD-1B is a direct distillation of SDXL. So, family-wise they share 99% of the similarities. I don't think SSD-1B should get its own UNet. It might create a bit too much of a modularity which we might not wat.

UNet that doesn't use a Cross Attn Midblock or that does not have a mid block. IMO I think we can add the option to support multiple midblock types/no midblock.

This works for me. But we have to do this in a new way that doesn't compromise with the linearity of how models are built in diffusers.

GLIGEN UNet that relies on additional model components. This should be it's own UNet.

100 percent. GLIGEN has got lots of different components and it deserves its own UNet.

For pt 4. and 5., we still need to find a way to support them even if via separate UNet, right? Perhaps, judging via the parameters that are less used and clubbing them under a particular config to have a dedicated class?

ControlNet and Adapter logic is still in this UNet. We have to decide how to handle them (whether to remove or not)

Since SD models support ControlNet, T2I Adapters, and IP Adapters, it might be difficult to fully remove them no?

sayakpaul avatar Jan 04 '24 09:01 sayakpaul

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Feb 03 '24 15:02 github-actions[bot]

Generally looks good to me. Some comments:

Any SD Pipeline using an Asymmetric UNet i.e. uses the reverse_transformer_layers_per_block parameter. Segmind UNets fall under this category and I think they should get their own UNet.

I don't think the asymmetric UNet should have it's own class. It should be supported here IMO. It a common use case by now and we have multiple powerful checkpoints by segmind that are asymmetric

  1. UNet that doesn't use a Cross Attn Midblock or that does not have a mid block. IMO I think we can add the option to support multiple midblock types/no midblock.

Agree - do we have a checkpoint of a model that doesn't have a midblock?

  1. GLIGEN UNet that relies on additional model components. This should be it's own UNet.
  2. UNets with dual cross attention (these have almost no downloads)

Agree.

  1. UNets that use class_embedding_concat . There isn't much usage for this parameter. And class embeddings are mainly used in the Stable UnCLIP UNet that just adds them to the time embedding

Is it just Stable UnCLIP? Did you also check all the reasonably often used upscaling unets (StableLatentUpscaling, StableUpscaling, ...) ?

patrickvonplaten avatar Feb 09 '24 14:02 patrickvonplaten

Is it just Stable UnCLIP? Did you also check all the reasonably often used upscaling unets (StableLatentUpscaling, StableUpscaling, ...) ?

The Latent Upscaler uses different block types and as well as a fourier time_embedding_type and uses arguments like resnet_time_scale_shift and timestep_post_act_fn. So maybe it should be its own UNet?

The StableDiffusionUpscaler does use num_class_embeds as an argument. So I will add that back to this UNet design.

DN6 avatar Feb 15 '24 09:02 DN6

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Mar 10 '24 15:03 github-actions[bot]

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Apr 05 '24 15:04 github-actions[bot]