CrossAttentionControl icon indicating copy to clipboard operation
CrossAttentionControl copied to clipboard

An observation

Open sameerKgp opened this issue 2 years ago • 3 comments

Hi, thanks for the code. I have observed that in the examples you have provided, even if I just directly use the cross attention from the edited prompt by commenting out the line "attn_slice = attn_slice * (1 - self.last_attn_slice_mask) + new_attn_slice * self.last_attn_slice_mask", I get the same result for most of the cases. I checked for the cases where the words are replaced or new phrases like ' in winter' is added. So, it seems like the cross attention editing is not having any effect. Please comment on this. Thanks.

sameerKgp avatar Aug 22 '23 10:08 sameerKgp

I think there is a bug in the "stablediffusion" function of CrossAttention_Release_NoImages.py. The same latent is being used both for the noise_cond and noise_cond_edit prediction at every step. But these should be different. With this change, it gives same results as the official code. Attaching a shot of the correction corrected_sd_p2p

sameerKgp avatar Aug 28 '23 10:08 sameerKgp

Hi, thanks for catching the mistake! The official repo code was released after mine, and I probably misunderstood this part of the algorithm from the paper... I didn't have time to revisit the algorithm since I originally wrote it.

Does this change improve the quality of the generations? If you don't mind, feel free to create a pull request or create a fork of this repo...

Edit: Also I'm just stunned that the method was working with this bug. I don't quite understand what you mean by "the cross attention editing is not having any effect", if you added "in winter" in a SD prompt without using this repo, the entire image changes, but with this repo, cross attention seems to have an effect.

bloc97 avatar Aug 28 '23 19:08 bloc97

I meant that if you just replace the self-attention maps for the first 20 or so steps and use the cross-attention maps from the edit-prompt only, then also it gives similar results. But that is just an observation and not a problem with the code. Self-attention seems to be more important in preserving the scene layout in many cases.

sameerKgp avatar Aug 30 '23 11:08 sameerKgp