IP-Adapter FaceID PLus How to use questions
https://github.com/huggingface/diffusers/blob/9ef43f38d43217f690e222a4ce0239c6a24af981/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L492
error msg:
pipe.unet.encoder_hid_proj.image_projection_layers[0].clip_embeds = clip_embeds.to(dtype=torch.float16)
AttributeError: 'list' object has no attribute 'to'
hi! I'm having some problems using the ip adapter FaceID PLus. Can you help me answer these questions? Thank you very much
- first question: What should I pass in the
ip_adapter_imageparameter in theprepare_ip_adapter_image_embedsfunction - second question: What problem does this cause when the following code does not match in the merge code link below and in the example in the ip_adapter.md file
this is merge link:
https://github.com/huggingface/diffusers/pull/7186#issuecomment-1986961595
Differential code:
ref_images_embeds = torch.stack(ref_images_embeds, dim=0).unsqueeze(0) neg_ref_images_embeds = torch.zeros_like(ref_images_embeds) id_embeds = torch.cat([neg_ref_images_embeds, ref_images_embeds]).to(dtype=torch.float16, device="cuda"))
@yiyixuxu @fabiorigano
os:
diffusers==diffusers-0.28.0.dev0
this is my code:
# @FileName:StableDiffusionIpAdapterFaceIDTest.py
# @Description:
# @Author:dyh
# @Time:2024/4/24 11:45
# @Website:www.xxx.com
# @Version:V1.0
import cv2
import numpy as np
import torch
from PIL import Image
from diffusers import StableDiffusionPipeline
from insightface.app import FaceAnalysis
from transformers import CLIPVisionModelWithProjection
model_path = '../../../aidazuo/models/Stable-diffusion/stable-diffusion-v1-5'
clip_path = '../../../aidazuo/models/CLIP-ViT-H-14-laion2B-s32B-b79K'
ip_adapter_path = '../../../aidazuo/models/IP-Adapter-FaceID'
ip_img_path = '../../../aidazuo/jupyter-script/test-img/vermeer.png'
def extract_face_features(image_lst: list, input_size: tuple):
# Extract Face features using insightface
ref_images = []
app = FaceAnalysis(name="buffalo_l",
root=ip_adapter_path,
providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
app.prepare(ctx_id=0, det_size=input_size)
for img in image_lst:
image = cv2.cvtColor(np.asarray(img), cv2.COLOR_BGR2RGB)
faces = app.get(image)
image = torch.from_numpy(faces[0].normed_embedding)
ref_images.append(image.unsqueeze(0))
ref_images = torch.cat(ref_images, dim=0)
return ref_images
ip_adapter_img = Image.open(ip_img_path)
image_encoder = CLIPVisionModelWithProjection.from_pretrained(
clip_path,
torch_dtype=torch.float16,
use_safetensors=True
)
pipe = StableDiffusionPipeline.from_pretrained(
model_path,
variant="fp16",
safety_checker=None,
image_encoder=image_encoder,
torch_dtype=torch.float16).to("cuda")
adapter_file_lst = ["ip-adapter-faceid-plus_sd15.bin"]
adapter_weight_lst = [0.5]
pipe.load_ip_adapter(ip_adapter_path, subfolder=None, weight_name=adapter_file_lst)
pipe.set_ip_adapter_scale(adapter_weight_lst)
face_id_embeds = extract_face_features([ip_adapter_img], ip_adapter_img.size)
clip_embeds = pipe.prepare_ip_adapter_image_embeds(ip_adapter_image=[ip_adapter_img],
ip_adapter_image_embeds=None,
device='cuda',
num_images_per_prompt=1,
do_classifier_free_guidance=True)
pipe.unet.encoder_hid_proj.image_projection_layers[0].clip_embeds = clip_embeds.to(dtype=torch.float16)
pipe.unet.encoder_hid_proj.image_projection_layers[0].shortcut = False # True if Plus v2
generator = torch.manual_seed(33)
images = pipe(
prompt='a beautiful girl',
ip_adapter_image_embeds=clip_embeds,
negative_prompt="",
num_inference_steps=30,
num_images_per_prompt=1,
generator=generator,
width=512,
height=512).images
print(images)
hi,
- please refer to documentation, here you have the link to the face models. can you try the following code?
clip_embeds = pipeline.prepare_ip_adapter_image_embeds(
[ip_adapter_images], None, torch.device("cuda"), num_images, True)[0]
- if you use CFG (classifier-free guidance), you must provide both
neg_ref_images_embedsandref_images_embeds. in the original implementation this is the default behaviour
hi,
- please refer to documentation, here you have the link to the face models. can you try the following code?
clip_embeds = pipeline.prepare_ip_adapter_image_embeds( [ip_adapter_images], None, torch.device("cuda"), num_images, True)[0]
- if you use CFG (classifier-free guidance), you must provide both
neg_ref_images_embedsandref_images_embeds. in the original implementation this is the default behaviour
1、 ok!
I successfully passed the test demo, but the test case seems to have an extra parenthesis in this line of code
this code: id_embeds = torch.cat([neg_ref_images_embeds, ref_images_embeds]).to(dtype=torch.float16, device="cuda"))
And when I modified this test code to the plus version, he reported the following error:
File "C:\work\pythonProject\demo01\venv\lib\site-packages\torch\nn\modules\conv.py", line 456, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [512]
This is my revised code:
import cv2
import numpy as np
import torch
from PIL import Image
from diffusers import StableDiffusionPipeline, DDIMScheduler
from insightface.app import FaceAnalysis
from transformers import CLIPVisionModelWithProjection
model_path = '../../../aidazuo/models/Stable-diffusion/stable-diffusion-v1-5'
clip_path = '../../../aidazuo/models/CLIP-ViT-H-14-laion2B-s32B-b79K'
ip_adapter_path = '../../../aidazuo/models/IP-Adapter-FaceID'
ip_img_path = '../../../aidazuo/jupyter-script/test-img/ip_mask_girl1.png'
image_encoder = CLIPVisionModelWithProjection.from_pretrained(
clip_path,
torch_dtype=torch.float16,
use_safetensors=True
)
pipeline = StableDiffusionPipeline.from_pretrained(
model_path,
torch_dtype=torch.float16,
image_encoder=image_encoder
).to("cuda")
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
pipeline.load_ip_adapter(ip_adapter_path, subfolder=None, weight_name="ip-adapter-faceid-plus_sd15.bin",
image_encoder_folder=None)
pipeline.set_ip_adapter_scale(0.6)
image = Image.open(ip_img_path)
ref_images_embeds = []
app = FaceAnalysis(name="buffalo_l", root=ip_adapter_path, providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
app.prepare(ctx_id=0, det_size=(640, 640))
image = cv2.cvtColor(np.asarray(image), cv2.COLOR_BGR2RGB)
faces = app.get(image)
image = torch.from_numpy(faces[0].normed_embedding)
ref_images_embeds.append(image.unsqueeze(0))
ref_images_embeds = torch.stack(ref_images_embeds, dim=0).unsqueeze(0)
neg_ref_images_embeds = torch.zeros_like(ref_images_embeds)
id_embeds = torch.cat([neg_ref_images_embeds, ref_images_embeds]).to(dtype=torch.float16, device="cuda")
generator = torch.Generator(device="cpu").manual_seed(42)
clip_embeds = pipeline.prepare_ip_adapter_image_embeds([image], None, torch.device("cuda"), 1, True)[0]
pipeline.unet.encoder_hid_proj.image_projection_layers[0].clip_embeds = clip_embeds.to(dtype=torch.float16)
pipeline.unet.encoder_hid_proj.image_projection_layers[0].shortcut = False # True if Plus v2
images = pipeline(
prompt="A photo of a girl",
ip_adapter_image_embeds=[id_embeds],
negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
num_inference_steps=20, num_images_per_prompt=1,
generator=generator
).images
2、Does CFG refer to the "guidance_scale" parameter? It always seems to have a value, and if its value is 0, don't we need to add these two lines of code?
thank you for spotting the error, it seems there is another one, I will fix documentation in a future PR
I forgot to upload the correct preprocessing for Face ID plus model:
from insightface.utils import face_align
ref_images_embeds = []
ip_adapter_images = []
app = FaceAnalysis(name="buffalo_l", providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
app.prepare(ctx_id=0, det_size=(640, 640))
image = cv2.cvtColor(np.asarray(image), cv2.COLOR_BGR2RGB)
faces = app.get(image)
ip_adapter_images.append(face_align.norm_crop(image, landmark=faces[0].kps, image_size=224))
image = torch.from_numpy(faces[0].normed_embedding)
ref_images_embeds.append(image.unsqueeze(0))
ref_images_embeds = torch.stack(ref_images_embeds, dim=0).unsqueeze(0)
neg_ref_images_embeds = torch.zeros_like(ref_images_embeds)
id_embeds = torch.cat([neg_ref_images_embeds, ref_images_embeds]).to(dtype=torch.float16, device="cuda")
generator = torch.Generator(device="cpu").manual_seed(42)
clip_embeds = pipeline.prepare_ip_adapter_image_embeds([ip_adapter_images], None, torch.device("cuda"), 1, True)[0]
pipeline.unet.encoder_hid_proj.image_projection_layers[0].clip_embeds = clip_embeds.to(dtype=torch.float16)
pipeline.unet.encoder_hid_proj.image_projection_layers[0].shortcut = False
- for the Face ID models we have to prepare the inputs before passing them to the pipeline, so you have to create it as written in the example code
thank you for spotting the error, it seems there is another one, I will fix documentation in a future PR
I forgot to upload the correct preprocessing for Face ID plus model:
from insightface.utils import face_align ref_images_embeds = [] ip_adapter_images = [] app = FaceAnalysis(name="buffalo_l", providers=['CUDAExecutionProvider', 'CPUExecutionProvider']) app.prepare(ctx_id=0, det_size=(640, 640)) image = cv2.cvtColor(np.asarray(image), cv2.COLOR_BGR2RGB) faces = app.get(image) ip_adapter_images.append(face_align.norm_crop(image, landmark=faces[0].kps, image_size=224)) image = torch.from_numpy(faces[0].normed_embedding) ref_images_embeds.append(image.unsqueeze(0)) ref_images_embeds = torch.stack(ref_images_embeds, dim=0).unsqueeze(0) neg_ref_images_embeds = torch.zeros_like(ref_images_embeds) id_embeds = torch.cat([neg_ref_images_embeds, ref_images_embeds]).to(dtype=torch.float16, device="cuda") generator = torch.Generator(device="cpu").manual_seed(42) clip_embeds = pipeline.prepare_ip_adapter_image_embeds([ip_adapter_images], None, torch.device("cuda"), 1, True)[0] pipeline.unet.encoder_hid_proj.image_projection_layers[0].clip_embeds = clip_embeds.to(dtype=torch.float16) pipeline.unet.encoder_hid_proj.image_projection_layers[0].shortcut = False
- for the Face ID models we have to prepare the inputs before passing them to the pipeline, so you have to create it as written in the example code
With the new preprocessing method described above I have been able to pass the PLus test. Thank you very much for your answer!
@fabiorigano does this code work with loading multiple different ip adapters without restriction?
For instance if I want to load a face plus v1 and v2 adapter is that possible? I would assume not because how can I set
pipeline.unet.encoder_hid_proj.image_projection_layers[0].shortcut = False
per adapter.
Additionally it is unclear to me how to have a collection face id and none face adapters. Is that supported?
Hi @jfischoff You should be able to load both Face ID Plus models. You should pass a list with their names to the load_ip_adapter method:
pipeline.load_ip_adapter("h94/IP-Adapter-FaceID", subfolder=None, weight_name=["ip-adapter-faceid-plus_sd15.bin", "ip-adapter-faceid-plusv2_sd15.bin"])
Then, just for the second element of the projection layer list: pipeline.unet.encoder_hid_proj.image_projection_layers[1].shortcut = True
Thanks for the response @fabiorigano.
So should I set
pipeline.unet.encoder_hid_proj.image_projection_layers[i].clip_embeds = faceid_clip_embeds[i]
pipeline.unet.encoder_hid_proj.image_projection_layers[i].shortcut = is_v2[i]
for each face ip adapter?
Is it a problem if I have loaded a mix of non-faceid ip adapters and face id adapters? Does that affect the index I need to use in image_projection_layers or is image_projection_layers only used by the faceid ip adapters? Should I set the clip_embeds for non-faceid plus models as well?
What about how I pass images/embed to the pipeline when I have a mix of face id and non-faceid adapters? If I'm using a faceid model, should I include the embeddings in the same are when calling the pipeline?
yes, that's correct
Each ip adapter passed in the list to the load_ip_adapter method has its corresponding image_projection_layers module, so be sure to index the correct one :)
the clip_embeds attribute is only needed for Face ID Plus models, because these adapters (v1 and v2) were trained with both CLIP image embeddings and insightface embeddings.
You can combine different IP adapters; I have tested some combinations. As anticipated above, it is not necessary to set CLIP embeddings to the other image projection modules, and you would get an error because the clip_embeds attribute doesn't exist in the other image projection classes.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.