diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

More thorough guidance for multiple IP adapter images/masks and a single IP Adapter

Open chrismaltais opened this issue 8 months ago • 9 comments

Describe the bug

I'm trying to use a single IP adapter with multiple IP adapter images and masks. This section of the docs gives an example of how I could do that: https://huggingface.co/docs/diffusers/v0.29.0/en/using-diffusers/ip_adapter#ip-adapter-masking

The docs provide the following code:

from diffusers.image_processor import IPAdapterMaskProcessor

mask1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask1.png")
mask2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask2.png")

output_height = 1024
output_width = 1024

processor = IPAdapterMaskProcessor()
masks = processor.preprocess([mask1, mask2], height=output_height, width=output_width)

pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name=["ip-adapter-plus-face_sdxl_vit-h.safetensors"])
pipeline.set_ip_adapter_scale([[0.7, 0.7]])  # one scale for each image-mask pair

face_image1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl1.png")
face_image2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl2.png")

ip_images = [[face_image1, face_image2]]

masks = [masks.reshape(1, masks.shape[0], masks.shape[2], masks.shape[3])]

generator = torch.Generator(device="cpu").manual_seed(0)
num_images = 1

image = pipeline(
    prompt="2 girls",
    ip_adapter_image=ip_images,
    negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
    num_inference_steps=20,
    num_images_per_prompt=num_images,
    generator=generator,
    cross_attention_kwargs={"ip_adapter_masks": masks}
).images[0]

One important point that should be highlighted is that images/scales/masks must be lists of lists , otherwise we get the following error: Cannot assign 2 scale_configs to 1 IP-Adapter.

That error message is intuitive enough, however this gets confusing in other sections of the documentation, such as the set_ip_adapter_scale() function:

# To use original IP-Adapter
scale = 1.0
pipeline.set_ip_adapter_scale(scale)

# To use style block only
scale = {
    "up": {"block_0": [0.0, 1.0, 0.0]},
}
pipeline.set_ip_adapter_scale(scale)

# To use style+layout blocks
scale = {
    "down": {"block_2": [0.0, 1.0]},
    "up": {"block_0": [0.0, 1.0, 0.0]},
}
pipeline.set_ip_adapter_scale(scale)

# To use style and layout from 2 reference images
scales = [{"down": {"block_2": [0.0, 1.0]}}, {"up": {"block_0": [0.0, 1.0, 0.0]}}]
pipeline.set_ip_adapter_scale(scales)

Is it possible to use the style and layout from 2 reference images with a single IP Adapter? I tried doing something like the following, which builds on the knowledge of needing to use a list of lists:

# List of lists to support multiple images/scales/masks with a single IP Adapter
scales = [[{"down": {"block_2": [0.0, 1.0]}}, {"up": {"block_0": [0.0, 1.0, 0.0]}}]]
pipeline.set_ip_adapter_scale(scales)

# OR

# Use layout and style from InstantStyle for one image, but also use a numerical scale value for the other
scale = {
    "down": {"block_2": [0.0, 1.0]},
    "up": {"block_0": [0.0, 1.0, 0.0]},
}
pipeline.set_ip_adapter_scale([[0.5, scale]])

but I get the following error:

TypeError: unsupported operand type(s) for *: 'dict' and 'Tensor'\n
At:
 /usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py(2725): __call__
/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py(549): forward
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1527): _call_impl
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl\n  /usr/local/lib/python3.10/dist-packages/diffusers/models/attention.py(366): forward\n  /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1527): _call_impl\n  /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl\n  /usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/transformer_2d.py(440): forward\n  /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1527): _call_impl\n  /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl\n  /usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_2d_blocks.py(1288): forward\n  /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1527): _call_impl\n  /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl\n  /usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_2d_condition.py(1220): forward\n  /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1527): _call_impl\n  /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl\n  /usr/local/lib/python3.10/dist-packages/diffusers/pipelines/controlnet/pipeline_controlnet_sd_xl.py(1510): __call__\n  /usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py(115): decorate_context

Reproduction

  1. Load single IP Adapter into pipeline
  2. Use two IP adapter images, two masks, two scales
  3. Try to use InstantStyle config to set IP Adapter scale
from diffusers import AutoPipelineForText2Image
from diffusers.utils import load_image
import torch
import PIL

# Subject/Foreground Style/Mask
subject_style_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg")
subject_mask = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask1.png")

# Background Style/Mask
background_style_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png")
background_mask = PIL.ImageOps.invert(subject_mask)

# Load pipeline + IP Adapter
pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")
generator = torch.Generator(device="cpu").manual_seed(26)

# Structure of subject, style of background
layout = {"down": {"block_2": [0.0, 1.0]}}
style = {"up": {"block_0": [0.0, 1.0, 0.0]}}
pipeline.set_ip_adapter_scale([[layout, style]])

# Preprocess mask images
processor = IPAdapterMaskProcessor()
ip_adapter_masks = processor.preprocess([subject_mask, background_mask]).cuda() # Might need to set width/height here
ip_adapter_masks = [
    ip_adapter_masks.reshape(
        1, ip_adapter_masks.shape[0], ip_adapter_masks.shape[2], ip_adapter_masks.shape[3]
    )
]

ip_adapter_images = [[subject_style_image, background_style_image]]

image = pipeline(
    prompt="a cat, masterpiece, best quality, high quality",
    ip_adapter_image=ip_adapter_images,
    negative_prompt="text, watermark, lowres, low quality, worst quality, deformed, glitch, low contrast, noisy, saturation, blurry",
    guidance_scale=5,
    num_inference_steps=30,
    generator=generator,
    cross_attention_kwargs={"ip_adapter_masks": ip_adapter_masks}
).images[0]

Logs

No response

System Info

  • diffusers version: 0.27.2
  • Platform: Linux-6.5.0-1020-gcp-x86_64-with-glibc2.35
  • Python version: 3.10.1
  • PyTorch version (GPU?): 2.1.2+cu121 (True)
  • Huggingface_hub version: 0.21.1
  • Transformers version: 4.39.2
  • Accelerate version: 0.28.0
  • xFormers version: 0.0.23.post1
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Who can help?

@sayakpaul @yiyixuxu @sayakpaul

chrismaltais avatar Jun 18 '24 18:06 chrismaltais