diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

add PAG support

Open yiyixuxu opened this issue 9 months ago • 10 comments

we are interested in integrating PAG into diffusers and want to use this PR to understand its compatibility and meaningful use cases. A few cases we know it should be good for includes controlnet, inpaiting and upscaler (@asomoza is testing these use cases). But there are so many other pipelines and techniques we have that it may or may not work well with PAG. for example, does it work with ip-adapter? lcm? animate-diff?

I made a PAGMixin in this PR so it can be easily applied to any pipeline. feel free to branch out this PR so that you can play with it and let us know your findings. I appreciate your help:)

testing script for StableDiffusionXLPipeline

from diffusers import StableDiffusionXLPipeline
import torch
from diffusers.utils import make_image_grid

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16
)
pipe.enable_model_cpu_offload()
# test1:
# base_cfg
generator = torch.Generator(device='cuda').manual_seed(1)
output_base_cfg = pipe(
        "an insect robot preparing a delicious meal, anime style",
        num_inference_steps=25,
        guidance_scale=7,
        generator=generator,
    ).images[0]

# base_uncond
generator = torch.Generator(device='cuda').manual_seed(1)
output_base_uncond = pipe(
        "an insect robot preparing a delicious meal, anime style",
        num_inference_steps=25,
        guidance_scale=0,
        generator=generator,
    ).images[0]

# test2: 
# pag_cfg

pipe.enable_pag(pag_scale=3.0, pag_applied_layers=['mid'])
generator = torch.Generator(device='cuda').manual_seed(1)

output_pag_cfg = pipe(
        "an insect robot preparing a delicious meal, anime style",
        num_inference_steps=25,
        guidance_scale=7,
        generator=generator,
    ).images[0]
# pag_uncond

pipe.disable_pag()
pipe.enable_pag(pag_scale=3.0, pag_applied_layers=['mid'], pag_cfg=False)
generator = torch.Generator(device='cuda').manual_seed(1)

output_pag_uncond = pipe(
        "an insect robot preparing a delicious meal, anime style",
        num_inference_steps=25,
        guidance_scale=0,
        generator=generator,
    ).images[0]

make_image_grid(
    [output_base_cfg, output_base_uncond, output_pag_cfg, output_pag_uncond], 
    rows =2, 
    cols=2).save("yiyi_test_11_out.png")

first row is Base (guidance_scale = 7.0, guidance_scale=0) second row is PAG (guidance_scale = 7.0, guidance_scale=0) yiyi_test_11_out

yiyixuxu avatar May 14 '24 10:05 yiyixuxu

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@asomoza can you test it out? I tried to make it work with ip-adapter but I don't think it works - do you know if PAG works with ip-adapter? what other pipelines should I add this too for testing? (

yiyixuxu avatar May 14 '24 10:05 yiyixuxu

cc @HyoungwonCho for awareness also question: does PAG work with IP-adapter?

yiyixuxu avatar May 14 '24 19:05 yiyixuxu

I've doing some tests and I like it a lot.

no PAG PAG CFG
20240515013510_925590493 20240515013548_925590493

I think it makes the robot more coherent and it fixes some of the wrong details, but it makes it less "humanoid" and loses a bit of the cinematic look.

I'm still deciding if I like more if we could use a layer or block naming like with the loras and ip_adapter or if pag_applied_layers and pag_applied_layers_index is better. I'll give some examples to evaluate this.

So lets say, I want to test it with what I normally use for the pose in the loras which are all the layers in the down block 2, with the current system I need to do this:

pag_applied_layers_index = ["d4", "d5", "d6", "d7", "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20", "d21", "d22", "d23"]`

the equivalent could be this:

pag_applied_layers = {"down": ["block_2"]}

or for example the last attention block which is what we can associate to the composition with IP Adapters:

pag_applied_layers_index = ["d14", "d15", "d16", "d17", "d18", "d19", "d20", "d21", "d22", "d23"]`

for this, a equivalent could be:

pag_applied_layers = {"down": "block_2": "attentions_1"}
down_block_2 down_block_2_attentions_1
20240515023111_925590493 20240515023839_925590493

I don't know if going as granular as each of the layers could bring a benefit, even someone like me that likes full control won't go as far as to try to control an image with 70 different layers on top of everything else.

As an example, as an advanced user, I want to use PAG to make the image better but without the robot losing it's humanoid form and the cinematic look.

Doing some quick tests, I found that for this particular image, this works really well:

pipeline.enable_pag(
    pag_scale=3.0,
    pag_applied_layers=None,
    pag_applied_layers_index=[
        "d4", "d5", "d6", "d7", "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20", "d21", "d22", "d23", "u0", "u1", "u2", "u3", "u4", "u5", "u6", "u7", "u8", "u9",
    ],
)

which in the lora format would be like this:

pag_applied_layers = {"down": "block_2", "up": "block_1": "attentions_0"}

20240515025029_925590493

Hope this example is somewhat clear, and also we can see that it matters a lot, the image is a lot better with this.

I'll do tests with the other use cases later, specially with the upscaler.

asomoza avatar May 15 '24 06:05 asomoza

@yiyixuxu @asomoza Hello, I was impressed by the various experiments you conducted using PAG! We are also discussing the use of PAG in various tasks, as well as layer/scale selection.

Since the guidance framework of PAG itself is simple, it seems quite possible to use it in conjunction with other modules like the IP-Adapter you mentioned. However, we have not yet implemented and experimented with it directly, so we have not confirmed whether there is a significant performance improvement when used together. If possible, we will conduct additional experiments in the future.

Thank you for your interest in our research.

HyoungwonCho avatar May 15 '24 13:05 HyoungwonCho

Thank you for the great work! However, I encountered the following issue when using StableDiffusionXLControlNetPipeline with CFG and PAG:

  File ".../.env/lib/python3.11/site-packages/diffusers/models/controlnet.py", line 798, in forward
    sample = sample + controlnet_cond
             ~~~~~~~^~~~~~~~~~~~~~~~~
RuntimeError: The size of tensor a (3) must match the size of tensor b (2) at non-singleton dimension 0

I solved it by adding a new parameter do_perturbed_attention_guidance and appending the following lines in the prepare_image method.

        if do_classifier_free_guidance and do_perturbed_attention_guidance and not guess_mode:
            image = torch.cat([image] * 3)
        elif do_classifier_free_guidance and not guess_mode:
            image = torch.cat([image] * 2)
        elif do_perturbed_attention_guidance and not guess_mode:
            image = torch.cat([image] * 2)

KKIEEK avatar May 15 '24 19:05 KKIEEK

@KKIEEK thanks! I added your change:)

yiyixuxu avatar May 15 '24 23:05 yiyixuxu

Just leaving a brief report of my findings with PAG and Diffusers (I already had it integrated in my pipelines before this PR):

  • It generally works very very well when properly tuned. Almost looks like a significant model upgrade.
  • I'm using it with models derived from SD2.1.
  • Implemented it successfuly in text-to-image, image-to-image, controlnet, unclip, and inpainting pipelines.
  • I get the best results with values around guidance_scale=7 and pag_scale=3
  • The layers to which it is applied makes a huge difference on the output. It's the difference between garbage and excellent. Adding or removing a single layer can make it or break it.
  • For example, for SD2.1, I found that with just [m0] the effect was too subtle, [d4, d5, m0] was overcooked, [d5, m0] seems to work best; adding any up layers typically screws up the results [d5, m0, u0].
  • The applied layers will obviously change in different model architectures. And I imagine that the "optimal" layers might even change with fine-tunes. I couldn't replicate the optimal parameters described in the paper (for SD1.5), with SD2.1 (which has the same unet architecture).

jorgemcgomes avatar May 16 '24 10:05 jorgemcgomes

@jorgemcgomes thanks!

yiyixuxu avatar May 20 '24 17:05 yiyixuxu

Hello. I'm an author of PAG. Thank you for your insightful opinions and cool implementation. Is there anything currently in progress? We are excited to see that PAG is gaining popularity within the community and being utilized in various workflows. Especially in ComfyUI, PAG nodes are used in diverse workflows.

(Some workflows using PAG in ComfyUI: https://www.reddit.com/r/StableDiffusion/comments/1c68qao/perturbedattention_guidance_really_helps_with/ https://civitai.com/models/141592/pixelwave https://civitai.com/models/413564/cjs-super-simple-high-detail-cosxl-and-pag-workflow https://www.reddit.com/r/StableDiffusion/comments/1c4cb3l/improve_stable_diffusion_prompt_following_image/ https://www.reddit.com/r/StableDiffusion/comments/1ck69az/make_it_good_options_in_stable_diffusion/ https://stable-diffusion-art.com/perturbed-attention-guidance/)

However, in Diffusers, it seems somewhat challenging to try creative combinations as the pipelines are separated. ( a collection of PAG pipelines with Diffusers: https://x.com/multimodalart/status/1788844183760847106 )

Therefore, the MixIn approach taken in this PR appears to be a very effective solution. However, it seems a bit awkward to call enable_pag every time to adjust the pag scale. Ideally, it would be more natural to set the pag_scale when calling the pipeline after enable_pag (similar to setting ip_adapter_image=image after in load_ip_adapter). So, I'm exploring a better design for this.

Additionally, since there are many users who want compatibility with IP-adapter, now I have time and would like to work on making it compatible with IPAdapter. I'm curious if there's any related progress about component design or IP-adapter compatibility.

Thank you!

sunovivid avatar May 24 '24 02:05 sunovivid

@sunovivid thanks for the message! this is not the finalized design just something we can use to test out compatibility of PAG - we will iterate on the final design

for IP-adapter, it will be super cool if we can make it work! I'm not aware of any related progress so would really appreciate if you are able to find time to work on this! maybe we can just pick one of the pipelines from this PR (with the mixin) and make it work with ip_adpter_image input?

yiyixuxu avatar May 28 '24 22:05 yiyixuxu

@yiyixuxu Hi! I made a working version of PAG + IP-adapter. Can you check the PR?

sunovivid avatar Jun 02 '24 18:06 sunovivid

@sunovivid we will merge in and work on a new design for PAG once you upload the new change for ip-adapter :)

for pag_applied_layers:

  1. I think we should use the lora format, let me know what you think @sunovivid: see @asomoza 's comments and experiments here https://github.com/huggingface/diffusers/pull/7944#issuecomment-2111728298; you can also find more about the scale dict we support in ip-adapter and lora here and here
  2. is pag_applied_layers something we would want to change a lot for different generations? i.e. can we make it a pipeline config/attribute instead of a call argument? I think we will have to make pag_scale a call argument

yiyixuxu avatar Jun 03 '24 21:06 yiyixuxu

Hi @yiyixuxu,

Thank you for the feedback!

I might have misunderstood something. Should I upload the new changes for the ip-adapter in this PR? How can I upload the changes? Should I attach files or use another approach?

for pag_applied_layers:

  1. Completely agree! For user convenience, the overall code should consistently follow the conventions used in the Diffusers codebase.
  2. I believe once the best choice for pag_applied_layers is determined per model through experiments (like the great example you provided in @asomoza's comment), it likely won't need frequent changes. Users will likely follow the recommended approach for each model. I also agree that pag_scale should be a call argument.

sunovivid avatar Jun 04 '24 08:06 sunovivid

@HyoungwonCho @sunovivid this PR is ready for a final review now! I would appreciate it if you could also take a look! I updated the PR description https://github.com/huggingface/diffusers/pull/7944#issue-2295049124

yiyixuxu avatar Jun 10 '24 19:06 yiyixuxu

cc @apolinario and @vladmandic

we plan to support more popular features like PAG in diffusers, so design-wise, this PR sets the example for the future PRs. Would appreciate your inputs too:)

yiyixuxu avatar Jun 10 '24 19:06 yiyixuxu

thanks @yiyixuxu

from a quick glance, new "magic" is mostly in src/diffusers/pipelines/auto_pipeline.py triggered on kwargs.

PAG itself is still a separate pipeline and can be used as a separate pipeline, its just that autopipeline will do automatic switching if enable_pag is in kwargs:

orig_class_name = orig_class_name.replace("Pipeline", "PAGPipeline")

i'm ok with that, one potential issue is propagation of future fixes - e.g. if there is a fix created for somewhere in StableDiffusionPipeline and autopipeline does behind-the-scene switch to StableDiffusionPAGPipeline, then we really need to ensure there are no regressions there since user is not even explicitly aware of that switch

just not sure about the mappings using string replace - ok for PAG, but would this pattern apply universally?

text_2_image_cls.name.replace("PAG", "").replace("Pipeline", "PAGPipeline"),

vladmandic avatar Jun 10 '24 20:06 vladmandic

Thanks for your hard work!

In my opinion, it looks good. One minor concern, similar to @vladmandic's opinion, is that propagating future changes and updates might be tedious work. It might be better to work like IP-Adapter, which is fully merged into the original pipeline. However, I also totally agree with your opinion that we should keep the codebase as compact as possible since it is already very complex and supports many papers. Compared to IP-Adapter, which is a relatively simple add-on, supporting PAG requires a batch size of 3, which breaks the common presumption of using a batch size of 2 for CFG. So this is a tradeoff, and I support both opinions from the diffusers team.

A minor suggestion: in line 185 of src/diffusers/pipelines/pag/pag_utils.py (I also wrote this as a comment), the noise_pred_uncond is actually conditional, so for clarity, I think it would be better to use noise_pred_text.

Thank you again for your hard work.

sunovivid avatar Jun 12 '24 06:06 sunovivid

@yiyixuxu Hello,

The implementation of PAG seems flawless! Aside from the noise_pred_uncond part mentioned by sunovivid, it looks perfectly implemented. Thank you very much for implementing and merging our paper into the diffusers library. :)

I also share a similar opinion regarding the integration of the PAG pipeline with the basic stable diffusion pipeline. Since PAG can be widely used for sampling under various conditions and can be easily toggled on/off, it seems it would be useful if merged into the basic pipeline. However, when pag and cfg are used together, the input batch size changes from the usual situation, which could make the implementation of additional papers and elements relatively more complex. As @sunovivid mentioned, it seems we need to balance the convenience of using pag by adding it to the basic pipeline with the simplicity of the code. I will endorse the decision of the diffusers administrators on this matter.

HyoungwonCho avatar Jun 12 '24 09:06 HyoungwonCho

I'm testing it with controlnet and If I enable guess_mode I get this error:

down_block_res_sample = down_block_res_sample + down_block_additional_residual
                            ~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RuntimeError: The size of tensor a (3) must match the size of tensor b (2) at non-singleton dimension 0

asomoza avatar Jun 13 '24 20:06 asomoza

PAG keeps surprising me. I tested it with ControlNet and even without guess_mode, the difference is really impressive.

preprocessed without pag with pag
20240613161704 20240613162602_1471724984 20240613163558_1471724984

@yiyixuxu can you add also the img2img sdxl pipeline? That's the one I need for the upscaler.

asomoza avatar Jun 13 '24 20:06 asomoza

@asomoza

added guess_mode support but it does not work well I think

from left to right:

  • no pag & no CFG
  • CFG (guidance_scale=7.5)
  • pag(pag_scale=3.0)

yiyi_test_10_out_guess_mode_True

yiyixuxu avatar Jun 17 '24 03:06 yiyixuxu

@asomoza img2img added!

yiyixuxu avatar Jun 17 '24 07:06 yiyixuxu

I believe this is still missing StableDiffusionXLControlNetPAGImg2ImgPipeline?

dboshardy avatar Jun 17 '24 17:06 dboshardy

When I tried to use pag with guess mode I got this error now:

models/controlnet.py", line 791, in forward
    add_embeds = torch.concat([text_embeds, time_embeds], dim=-1)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

But I forced it to "cuda" and tested it.

pag - no guess mode no pag - guess_mode pag - guess_mode - no cfg
20240613163558_1471724984 20240617165000_1471724984 20240617174030_1471724984

So it's definitely not working. I'll try digging into it later but this is not that important, the result without guess_mode is good enough as it is.

Also since we have now the controlnet and the img2img, I also think it makes sense to have the StableDiffusionXLControlNetPAGImg2ImgPipeline.

asomoza avatar Jun 17 '24 21:06 asomoza

@asomoza Yeah so there is one version the author implemented here for sd1.5 https://huggingface.co/hyoungwoncho/sd_perturbed_attention_guidance_controlnet

I tried it I think it did not work there either

for StableDiffusionXLControlNetPAGImg2ImgPipeline, totally but can be added later - one advantage of implement it this way is that it is easier for the community to contribute :)

yiyixuxu avatar Jun 17 '24 22:06 yiyixuxu

@asomoza should we remove the guess_mode if it's not working yet? cc @HyoungwonCho @KKIEEK here too

yiyixuxu avatar Jun 18 '24 17:06 yiyixuxu

should we remove the guess_mode

IMO yes, if not, it could give the users the wrong impression that it works with PAG.

asomoza avatar Jun 18 '24 17:06 asomoza

cc @stevhliu it is pretty much ready to merge now - do you want to take a look at the doc? feel free to refactor later too

yiyixuxu avatar Jun 25 '24 05:06 yiyixuxu

@HyoungwonCho @sunovivid

thanks for all the support you provided throughout PAG + diffusers integration! about your concerns regarding propagating future changes, we are very much aware of this additional maintenance burden - It is a trade-off we very concisely made, and I think we have been managing it pretty well with the help from our community:)

I will merge this PR soon and try our best to promote PAG moving forward. It is indeed an amazing technique:)

do you guys have any plans to add PAG support for SD3 and Hunyuan-DIT?

yiyixuxu avatar Jun 25 '24 08:06 yiyixuxu