ComfyUI_InstantID High-level view of the procedure for multi-ID synthesis

Hi! This is an amazing work and I would like to understand more. I am new to ComfyUI and I am not very familiar with the source inference code released by InstantID, but I did read their paper.

So I was trying to look at your InstantID.py and here's what I can understand from it: Let's say we have input ID image A and B and a pose image P, to generate a multi-ID image, we do the following:

Mask the right half of P, put A and P through the pipeline of instantID, and get output latent feature A'
Mask the left half of P, put B and the flipped P through the pipeline of instantID, and get output latent feature B'
We then combine A' and B' as if they are one complete latent feature, and we send this to the next denoising step.

Is this understanding correct?

My question is:

Where do you inject the text prompt for each image (the instantID pipeline inject it in the IP adaptor), in the definition of ApplyInstantID in InstantID.py, it seems that the positive and negative text prompt is injected via the control-Net, which is called identityNet in instantID's paper, but in identity-Net the text prompt is replaced by the face embedding in the original instantID pipeline, so I am confused here.
Even with attention masking, the background is still smooth, does this just come natural from the diffusion process or is there any optimization done?

Really appreciate any help because I want to modify the workflow to suit my own use case, but I need to understand it fully first. Thank you!

Feb 27 '24 04:02 yunbinmo

visual embeds are added to the text prompt not replaced.

sorry I don't understand the second question

Feb 27 '24 06:02 cubiq

visual embeds are added to the text prompt not replaced.

sorry I don't understand the second question

Regarding the first question, the below is quoted from the InstantID paper:

Where they mentioned that they use face embedding instead of text embedding in the control-net, so that's why I am confused. So can I confirm that the workflow is different from the original InstantID paper where the text prompt is also used in the controlnet, and an individual controlnet is used for each input image, right?

For the second question I am actually asking why we won't see any boundary (or some unnatural transition) in the middle of the background in the generated image even if attention masking is used? Because if the two image are attending to their own half of the image, intuitively there should be some inconsistency in the background?
And can I confirm that my understanding for the overall workflow is correct?

Sorry I have many questions and thank you so much for your time for asking them

Feb 27 '24 07:02 yunbinmo

Hi may I also ask for multi-ID synthesis why did you choose to inject the text prompt via the controlNet instead of via the cross attention in the IP adapter?

Feb 29 '24 00:02 yunbinmo

comfy supports a few ways of merging the embeds, this way we are compatible with all of them with very little effort. I agree it's a bit complicated like it is now, I'll see if I can find a better way

Feb 29 '24 08:02 cubiq

comfy supports a few ways of merging the embeds, this way we are compatible with all of them with very little effort. I agree it's a bit complicated like it is now, I'll see if I can find a better way

I see, thanks again for the amazing work done and looking forward to a better workflow!

Feb 29 '24 08:02 yunbinmo