diffusers IP-Adapter FaceID PLus How to use questions

https://github.com/huggingface/diffusers/blob/9ef43f38d43217f690e222a4ce0239c6a24af981/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L492

error msg:

pipe.unet.encoder_hid_proj.image_projection_layers[0].clip_embeds = clip_embeds.to(dtype=torch.float16)
AttributeError: 'list' object has no attribute 'to'

hi！ I'm having some problems using the ip adapter FaceID PLus. Can you help me answer these questions? Thank you very much

first question: What should I pass in the ip_adapter_image parameter in the prepare_ip_adapter_image_embeds function
second question: What problem does this cause when the following code does not match in the merge code link below and in the example in the ip_adapter.md file this is merge link: https://github.com/huggingface/diffusers/pull/7186#issuecomment-1986961595 Differential code:
```
ref_images_embeds = torch.stack(ref_images_embeds, dim=0).unsqueeze(0)
neg_ref_images_embeds = torch.zeros_like(ref_images_embeds)
id_embeds = torch.cat([neg_ref_images_embeds, ref_images_embeds]).to(dtype=torch.float16, device="cuda"))
```

@yiyixuxu @fabiorigano

os:

diffusers==diffusers-0.28.0.dev0

this is my code:

# @FileName：StableDiffusionIpAdapterFaceIDTest.py
# @Description：
# @Author：dyh
# @Time：2024/4/24 11:45
# @Website：www.xxx.com
# @Version：V1.0
import cv2
import numpy as np
import torch
from PIL import Image
from diffusers import StableDiffusionPipeline
from insightface.app import FaceAnalysis
from transformers import CLIPVisionModelWithProjection

model_path = '../../../aidazuo/models/Stable-diffusion/stable-diffusion-v1-5'
clip_path = '../../../aidazuo/models/CLIP-ViT-H-14-laion2B-s32B-b79K'
ip_adapter_path = '../../../aidazuo/models/IP-Adapter-FaceID'
ip_img_path = '../../../aidazuo/jupyter-script/test-img/vermeer.png'


def extract_face_features(image_lst: list, input_size: tuple):
    # Extract Face features using insightface
    ref_images = []
    app = FaceAnalysis(name="buffalo_l",
                       root=ip_adapter_path,
                       providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])

    app.prepare(ctx_id=0, det_size=input_size)
    for img in image_lst:
        image = cv2.cvtColor(np.asarray(img), cv2.COLOR_BGR2RGB)
        faces = app.get(image)
        image = torch.from_numpy(faces[0].normed_embedding)
        ref_images.append(image.unsqueeze(0))
    ref_images = torch.cat(ref_images, dim=0)

    return ref_images


ip_adapter_img = Image.open(ip_img_path)

image_encoder = CLIPVisionModelWithProjection.from_pretrained(
    clip_path,
    torch_dtype=torch.float16,
    use_safetensors=True
)

pipe = StableDiffusionPipeline.from_pretrained(
    model_path,
    variant="fp16",
    safety_checker=None,
    image_encoder=image_encoder,
    torch_dtype=torch.float16).to("cuda")

adapter_file_lst = ["ip-adapter-faceid-plus_sd15.bin"]
adapter_weight_lst = [0.5]

pipe.load_ip_adapter(ip_adapter_path, subfolder=None, weight_name=adapter_file_lst)
pipe.set_ip_adapter_scale(adapter_weight_lst)

face_id_embeds = extract_face_features([ip_adapter_img], ip_adapter_img.size)

clip_embeds = pipe.prepare_ip_adapter_image_embeds(ip_adapter_image=[ip_adapter_img],
                                                   ip_adapter_image_embeds=None,
                                                   device='cuda',
                                                   num_images_per_prompt=1,
                                                   do_classifier_free_guidance=True)

pipe.unet.encoder_hid_proj.image_projection_layers[0].clip_embeds = clip_embeds.to(dtype=torch.float16)
pipe.unet.encoder_hid_proj.image_projection_layers[0].shortcut = False  # True if Plus v2

generator = torch.manual_seed(33)
images = pipe(
    prompt='a beautiful girl',
    ip_adapter_image_embeds=clip_embeds,
    negative_prompt="",
    num_inference_steps=30,
    num_images_per_prompt=1,
    generator=generator,
    width=512,
    height=512).images

print(images)

Apr 24 '24 07:04 Honey-666

hi,

please refer to documentation, here you have the link to the face models. can you try the following code?

clip_embeds = pipeline.prepare_ip_adapter_image_embeds(
                [ip_adapter_images], None, torch.device("cuda"), num_images, True)[0]

if you use CFG (classifier-free guidance), you must provide both neg_ref_images_embeds and ref_images_embeds. in the original implementation this is the default behaviour

Apr 24 '24 08:04 fabiorigano

hi,

please refer to documentation, here you have the link to the face models. can you try the following code?
clip_embeds = pipeline.prepare_ip_adapter_image_embeds(
                [ip_adapter_images], None, torch.device("cuda"), num_images, True)[0]
if you use CFG (classifier-free guidance), you must provide both neg_ref_images_embeds and ref_images_embeds. in the original implementation this is the default behaviour

1、 ok! I successfully passed the test demo, but the test case seems to have an extra parenthesis in this line of code this code: id_embeds = torch.cat([neg_ref_images_embeds, ref_images_embeds]).to(dtype=torch.float16, device="cuda"))

And when I modified this test code to the plus version, he reported the following error：

  File "C:\work\pythonProject\demo01\venv\lib\site-packages\torch\nn\modules\conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [512]

This is my revised code:

import cv2
import numpy as np
import torch
from PIL import Image
from diffusers import StableDiffusionPipeline, DDIMScheduler
from insightface.app import FaceAnalysis
from transformers import CLIPVisionModelWithProjection

model_path = '../../../aidazuo/models/Stable-diffusion/stable-diffusion-v1-5'
clip_path = '../../../aidazuo/models/CLIP-ViT-H-14-laion2B-s32B-b79K'
ip_adapter_path = '../../../aidazuo/models/IP-Adapter-FaceID'
ip_img_path = '../../../aidazuo/jupyter-script/test-img/ip_mask_girl1.png'

image_encoder = CLIPVisionModelWithProjection.from_pretrained(
    clip_path,
    torch_dtype=torch.float16,
    use_safetensors=True
)

pipeline = StableDiffusionPipeline.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    image_encoder=image_encoder
).to("cuda")
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
pipeline.load_ip_adapter(ip_adapter_path, subfolder=None, weight_name="ip-adapter-faceid-plus_sd15.bin",
                         image_encoder_folder=None)
pipeline.set_ip_adapter_scale(0.6)

image = Image.open(ip_img_path)

ref_images_embeds = []
app = FaceAnalysis(name="buffalo_l", root=ip_adapter_path, providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
app.prepare(ctx_id=0, det_size=(640, 640))
image = cv2.cvtColor(np.asarray(image), cv2.COLOR_BGR2RGB)
faces = app.get(image)
image = torch.from_numpy(faces[0].normed_embedding)
ref_images_embeds.append(image.unsqueeze(0))
ref_images_embeds = torch.stack(ref_images_embeds, dim=0).unsqueeze(0)
neg_ref_images_embeds = torch.zeros_like(ref_images_embeds)
id_embeds = torch.cat([neg_ref_images_embeds, ref_images_embeds]).to(dtype=torch.float16, device="cuda")

generator = torch.Generator(device="cpu").manual_seed(42)

clip_embeds = pipeline.prepare_ip_adapter_image_embeds([image], None, torch.device("cuda"), 1, True)[0]

pipeline.unet.encoder_hid_proj.image_projection_layers[0].clip_embeds = clip_embeds.to(dtype=torch.float16)
pipeline.unet.encoder_hid_proj.image_projection_layers[0].shortcut = False  # True if Plus v2

images = pipeline(
    prompt="A photo of a girl",
    ip_adapter_image_embeds=[id_embeds],
    negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
    num_inference_steps=20, num_images_per_prompt=1,
    generator=generator
).images

2、Does CFG refer to the "guidance_scale" parameter? It always seems to have a value, and if its value is 0, don't we need to add these two lines of code?

Apr 24 '24 14:04 Honey-666

thank you for spotting the error, it seems there is another one, I will fix documentation in a future PR

I forgot to upload the correct preprocessing for Face ID plus model:

from insightface.utils import face_align

ref_images_embeds = []
ip_adapter_images = []
app = FaceAnalysis(name="buffalo_l", providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
app.prepare(ctx_id=0, det_size=(640, 640))
image = cv2.cvtColor(np.asarray(image), cv2.COLOR_BGR2RGB)
faces = app.get(image)
ip_adapter_images.append(face_align.norm_crop(image, landmark=faces[0].kps, image_size=224))
image = torch.from_numpy(faces[0].normed_embedding)
ref_images_embeds.append(image.unsqueeze(0))
ref_images_embeds = torch.stack(ref_images_embeds, dim=0).unsqueeze(0)
neg_ref_images_embeds = torch.zeros_like(ref_images_embeds)
id_embeds = torch.cat([neg_ref_images_embeds, ref_images_embeds]).to(dtype=torch.float16, device="cuda")

generator = torch.Generator(device="cpu").manual_seed(42)

clip_embeds = pipeline.prepare_ip_adapter_image_embeds([ip_adapter_images], None, torch.device("cuda"), 1, True)[0]

pipeline.unet.encoder_hid_proj.image_projection_layers[0].clip_embeds = clip_embeds.to(dtype=torch.float16)
pipeline.unet.encoder_hid_proj.image_projection_layers[0].shortcut = False

for the Face ID models we have to prepare the inputs before passing them to the pipeline, so you have to create it as written in the example code

Apr 24 '24 20:04 fabiorigano

thank you for spotting the error, it seems there is another one, I will fix documentation in a future PR

I forgot to upload the correct preprocessing for Face ID plus model:

from insightface.utils import face_align

ref_images_embeds = []
ip_adapter_images = []
app = FaceAnalysis(name="buffalo_l", providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
app.prepare(ctx_id=0, det_size=(640, 640))
image = cv2.cvtColor(np.asarray(image), cv2.COLOR_BGR2RGB)
faces = app.get(image)
ip_adapter_images.append(face_align.norm_crop(image, landmark=faces[0].kps, image_size=224))
image = torch.from_numpy(faces[0].normed_embedding)
ref_images_embeds.append(image.unsqueeze(0))
ref_images_embeds = torch.stack(ref_images_embeds, dim=0).unsqueeze(0)
neg_ref_images_embeds = torch.zeros_like(ref_images_embeds)
id_embeds = torch.cat([neg_ref_images_embeds, ref_images_embeds]).to(dtype=torch.float16, device="cuda")

generator = torch.Generator(device="cpu").manual_seed(42)

clip_embeds = pipeline.prepare_ip_adapter_image_embeds([ip_adapter_images], None, torch.device("cuda"), 1, True)[0]

pipeline.unet.encoder_hid_proj.image_projection_layers[0].clip_embeds = clip_embeds.to(dtype=torch.float16)
pipeline.unet.encoder_hid_proj.image_projection_layers[0].shortcut = False

for the Face ID models we have to prepare the inputs before passing them to the pipeline, so you have to create it as written in the example code

With the new preprocessing method described above I have been able to pass the PLus test. Thank you very much for your answer!

Apr 25 '24 02:04 Honey-666

@fabiorigano does this code work with loading multiple different ip adapters without restriction?

For instance if I want to load a face plus v1 and v2 adapter is that possible? I would assume not because how can I set

pipeline.unet.encoder_hid_proj.image_projection_layers[0].shortcut = False

per adapter.

Additionally it is unclear to me how to have a collection face id and none face adapters. Is that supported?

Jun 05 '24 17:06 jfischoff

Hi @jfischoff You should be able to load both Face ID Plus models. You should pass a list with their names to the load_ip_adapter method:

pipeline.load_ip_adapter("h94/IP-Adapter-FaceID", subfolder=None, weight_name=["ip-adapter-faceid-plus_sd15.bin", "ip-adapter-faceid-plusv2_sd15.bin"])

Then, just for the second element of the projection layer list: pipeline.unet.encoder_hid_proj.image_projection_layers[1].shortcut = True

Jun 05 '24 18:06 fabiorigano

Thanks for the response @fabiorigano.

So should I set

pipeline.unet.encoder_hid_proj.image_projection_layers[i].clip_embeds = faceid_clip_embeds[i]
pipeline.unet.encoder_hid_proj.image_projection_layers[i].shortcut = is_v2[i]

for each face ip adapter?

Is it a problem if I have loaded a mix of non-faceid ip adapters and face id adapters? Does that affect the index I need to use in image_projection_layers or is image_projection_layers only used by the faceid ip adapters? Should I set the clip_embeds for non-faceid plus models as well?

What about how I pass images/embed to the pipeline when I have a mix of face id and non-faceid adapters? If I'm using a faceid model, should I include the embeddings in the same are when calling the pipeline?

Jun 05 '24 19:06 jfischoff

yes, that's correct

Each ip adapter passed in the list to the load_ip_adapter method has its corresponding image_projection_layers module, so be sure to index the correct one :)

the clip_embeds attribute is only needed for Face ID Plus models, because these adapters (v1 and v2) were trained with both CLIP image embeddings and insightface embeddings.

You can combine different IP adapters; I have tested some combinations. As anticipated above, it is not necessary to set CLIP embeddings to the other image projection modules, and you would get an error because the clip_embeds attribute doesn't exist in the other image projection classes.

Jun 06 '24 06:06 fabiorigano

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sep 14 '24 15:09 github-actions[bot]