CrossAttentionControl icon indicating copy to clipboard operation
CrossAttentionControl copied to clipboard

Question about original google implementation with stable diffusion

Open ethansmith2000 opened this issue 2 years ago • 3 comments

Hi bloc, firstly thank you for your great work! I've been spending a lot of time trying to implement google's original release into a custom pipeline with diffusers. I figured it wouldn't be too difficult as they have an example there running with SD that looks pretty good. Although I'm getting very strange results even though everything seems to be in working order. I was considering that it may be because I had been using SD1.5 whereas they had been using 1.4, but I don't think there were any changes in architecture that would be causing that?

Could you elaborate a bit more on the changes you made to get it to work with stable?

ethansmith2000 avatar Dec 11 '22 07:12 ethansmith2000

Hi, I'm not too sure about the code difference between my implementation and the original, as this repo's code is not a modification of the authors' code but an independent implementation from scratch (there was no official implementation when this repo was made). However I might be able to help spot common problems, what exactly are the "strange results" you are getting?

Could you elaborate a bit more on the changes you made to get it to work with stable?

The main difference between Imagen and Stable is that Stable has an additional attn1 self-attention layer that is very important for image generation, while in the paper they only modify the attn2 cross-attention layer. The modification in this case is simply to also edit and or copy the attn1 layer with the attn2 layer at the same time.

bloc97 avatar Dec 11 '22 20:12 bloc97

the functions they use search through all named attn layers of the model and make the modifications as needed for self attn and cross, so I should think that shouldn't be too much of a problem? https://github.com/google/prompt-to-prompt/blob/main/prompt-to-prompt_stable.ipynb here is the link to their demo with SD.

here is an example of using the original prompt: "a panda at a picnic" with target prompt: "a dog at a picnic" (the replace method requires that only one word is altered) Screen Shot 2022-12-11 at 8 09 27 PM

meanwhile, this is the original output i get on that seed pre-injection, as well as post-injection when i set the attn replace steps to 0 Screen Shot 2022-12-11 at 8 11 16 PM

the only thing i can think of is that the example was done with sd1.4 but that doesn't seem like it would affect it. additionally since the effects entirely take place in the Unet, I haven't looked into what happens at any other part in the process, but i could definitely be missing something.

are there any variables you'd reccomend printing out? I am pretty new to the lower-level parts of attention, so don't have a great idea of where to start. Thank you for your reply!

ethansmith2000 avatar Dec 12 '22 01:12 ethansmith2000

Nevermind, got it working! I didn't realize that the prompt that goes into the text encoder has to be the new one. I'll be trying your repo as well afterwards

ethansmith2000 avatar Dec 12 '22 02:12 ethansmith2000