CrossAttentionControl
CrossAttentionControl copied to clipboard
Question about original google implementation with stable diffusion
Hi bloc, firstly thank you for your great work! I've been spending a lot of time trying to implement google's original release into a custom pipeline with diffusers. I figured it wouldn't be too difficult as they have an example there running with SD that looks pretty good. Although I'm getting very strange results even though everything seems to be in working order. I was considering that it may be because I had been using SD1.5 whereas they had been using 1.4, but I don't think there were any changes in architecture that would be causing that?
Could you elaborate a bit more on the changes you made to get it to work with stable?
Hi, I'm not too sure about the code difference between my implementation and the original, as this repo's code is not a modification of the authors' code but an independent implementation from scratch (there was no official implementation when this repo was made). However I might be able to help spot common problems, what exactly are the "strange results" you are getting?
Could you elaborate a bit more on the changes you made to get it to work with stable?
The main difference between Imagen and Stable is that Stable has an additional attn1 self-attention layer that is very important for image generation, while in the paper they only modify the attn2 cross-attention layer. The modification in this case is simply to also edit and or copy the attn1 layer with the attn2 layer at the same time.
the functions they use search through all named attn layers of the model and make the modifications as needed for self attn and cross, so I should think that shouldn't be too much of a problem? https://github.com/google/prompt-to-prompt/blob/main/prompt-to-prompt_stable.ipynb here is the link to their demo with SD.
here is an example of using the original prompt: "a panda at a picnic" with target prompt: "a dog at a picnic"
(the replace method requires that only one word is altered)

meanwhile, this is the original output i get on that seed pre-injection, as well as post-injection when i set the attn replace steps to 0

the only thing i can think of is that the example was done with sd1.4 but that doesn't seem like it would affect it. additionally since the effects entirely take place in the Unet, I haven't looked into what happens at any other part in the process, but i could definitely be missing something.
are there any variables you'd reccomend printing out? I am pretty new to the lower-level parts of attention, so don't have a great idea of where to start. Thank you for your reply!
Nevermind, got it working! I didn't realize that the prompt that goes into the text encoder has to be the new one. I'll be trying your repo as well afterwards