An observation
Hi, thanks for the code. I have observed that in the examples you have provided, even if I just directly use the cross attention from the edited prompt by commenting out the line "attn_slice = attn_slice * (1 - self.last_attn_slice_mask) + new_attn_slice * self.last_attn_slice_mask", I get the same result for most of the cases. I checked for the cases where the words are replaced or new phrases like ' in winter' is added. So, it seems like the cross attention editing is not having any effect. Please comment on this. Thanks.
I think there is a bug in the "stablediffusion" function of CrossAttention_Release_NoImages.py. The same latent is being used both for the noise_cond and noise_cond_edit prediction at every step. But these should be different. With this change, it gives same results as the official code. Attaching a shot of the correction
Hi, thanks for catching the mistake! The official repo code was released after mine, and I probably misunderstood this part of the algorithm from the paper... I didn't have time to revisit the algorithm since I originally wrote it.
Does this change improve the quality of the generations? If you don't mind, feel free to create a pull request or create a fork of this repo...
Edit: Also I'm just stunned that the method was working with this bug. I don't quite understand what you mean by "the cross attention editing is not having any effect", if you added "in winter" in a SD prompt without using this repo, the entire image changes, but with this repo, cross attention seems to have an effect.
I meant that if you just replace the self-attention maps for the first 20 or so steps and use the cross-attention maps from the edit-prompt only, then also it gives similar results. But that is just an observation and not a problem with the code. Self-attention seems to be more important in preserving the scene layout in many cases.