stable-diffusion-webui
stable-diffusion-webui copied to clipboard
Implementation of Stable Diffusion with Aesthetic Gradients
here the original repo: https://github.com/vicgalle/stable-diffusion-aesthetic-gradients
someone is working on this https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/2498 should probably review and see whats different
File "D:\stable-diffusion-webui-aesthetic\modules\sd_hijack.py", line 411, in forward
z = z * (1 - self.aesthetic_weight) + zn * self.aesthetic_weight
RuntimeError: The size of tensor a (154) must match the size of tensor b (77) at non-singleton dimension 1
It seems that the token length is limited by the CLIP model.
This seems to work well, but the default values are a bit odd.
The repo recommends an aesthetic learning rate of 0.0001, but you default to 0.005 which is an order of magnitude higher. Is there a specific reason for this?
Similarly for aesthetic steps the repo recommends starting with relatively small step amounts, but the default in this PR is the highest value that the UI is set to allow.
To be quick, I put "random" default values 😅 I fixed the problem of the token length and I added the UI for the generation of the embedding. I need some hours of sleep, tomorrow I'll commit the code
This feature is actually way more interesting than I thought. Pretty amazing the variations you can obtain using the images embeddings. I am still trying to figure out how to use all the different sliders and what they do... I really hope this will get merged someday.
I notices creating a new image embedding does not automatically get added to the pull down in text2img. Just a nit pick.
Quick example for those wondering. I created an image embedding from a bunch of big eyes paintings and tried to apply it to the simple "a beautiful woman" seed 0 prompt. Here are the results:
Original prompt image:
Applying the image embedding style with aesthetic: learning rate 0.001, weight 0.85 and steps 40:
Increasing the weight to 1 increasing the style application resulting in something closer to the original paintings:
Bringing it down to 0.5 will obviously reduce the effect:
And the beauty is that it requires almost no computing time. This is next level stuff... Magic!!!
Another example using the same prompt as above. I created an image embedding from a bunch of images at: https://lexica.art/?q=aadb4a24-2469-47d8-9497-cafc1f513071
After some fine tuning of the weights and learning rate I was able to get:
And from those https://lexica.art/?q=1f5ef1e0-9f3a-48b8-9062-d9120ba09274 I got:
And all this with literally no training what so ever. AMAZING!
This feature is actually way more interesting than I thought. Pretty amazing the variations you can obtain using the images embeddings. I am still trying to figure out how to use all the different sliders and what they do... I really hope this will get merged someday.
I notices creating a new image embedding does not automatically get added to the pull down in text2img. Just a nit pick.
Little bug. I'll fix it.
I even tried feeding it 19 pictures of me in a non 1:1 aspect ratio (512x640) and gosh darn... if produced passable results!
Sample input image:
Prompt with no Aesthetic applied:
Aesthetic applied:
Not as good as if I trained Dreambooth or TI but for a 1-minute fiddling it is amazing. It appears to apply the overall pose of some of the pictures I fed it. I wonder what would happen if I fed the thing with 100= photos of me in varying size... It is as if the size and ratio of images you feed it does not matter.
And what is amazing is that it does all this with a 4KB file!
I'd suggest hiding the interface behind the Extra checkbox or at least moving it lower. It's quite large and pushes more commonly used options like CFG and Batch size/count off-screen.
I'd suggest hiding the interface behind the Extra checkbox or at least moving it lower. It's quite large and pushes more commonly used options like CFG and Batch size/count off-screen.
Indeed. I doubt Automatic will like it where it is now... best would be some sort of tabs inside the parameter section to present the current options in a default tab and access the aesthetic options in an aesthetic tab beside it.
On a separate note... do you think the same thing could be added to img2img to offer better conformity to the original image? I sometime feel the aesthetic model is difficult to control. A some point it totally change the original image instead of changing the overall style of it. If it was possible to control the weight of the aesthetic on top of the resulting prompt image without it without losing the whole look it would be even better.
Another quick test. Old bearded man:
Prompt no aesthetic:
Aesthetic applied:
On a separate note... do you think the same thing could be added to img2img to offer better conformity to the original image? I sometime feel the aesthetic model is difficult to control. A some point it totally change the original image instead of changing the overall style of it. If it was possible to control the weight of the aesthetic on top of the resulting prompt image without it without losing the whole look it would be even better.
Yes, I think that it could work right now but I have not added the UI.
I'd suggest hiding the interface behind the Extra checkbox or at least moving it lower. It's quite large and pushes more commonly used options like CFG and Batch size/count off-screen.
Today, I will move the board and compact the interface. I think it makes more sense near the prompts, but it can go back there again in the future.
While this really is nice work, I definitely will not accept code that clutters the UI for users who don't want to use this, and I won't accept changes in code where you just take an existing line and change formatting of it without changing what it does.
Changing PIL.Image.BICUBIC to PIL.Image.Resampling.BICUBIC will break some old version of PIL on colabs so do not do that.
Why is there another CLIP being created when we already have one? If it is really needed for this, why is it always created regardless of whether the user wants the gradients?
An additional thing I'm going to ask of you is to isolate as much of your code into separate files as possible. The big chunk of code in sd_hijack should be in its own file. All the parameters of aesthetic gradients should be in members of your own class defined in your own file, not in sd_hijack.
One possible solution for non-cluttered UI is to let use specify an aesthetic embedding as text in the prompt; something like this:
a tree <aesthetic:weight=0.8, steps=30, slerp>
This will also have the benefit of putting all paramters into infotext, so that other users you share your prompt with will be able to reproduce it if they have the embedding.
Why is there another CLIP being created when we already have one? If it is really needed for this, why is it always created regardless of whether the user wants the gradients?
The only clip that I found is the CLIPTextModel, but we need also the text_projection that is in the CLIPModel class and the CLIPVisualModel to generate the embs.
An additional thing I'm going to ask of you is to isolate as much of your code into separate files as possible. The big chunk of code in sd_hijack should be in its own file. All the parameters of aesthetic gradients should be in members of your own class defined in your own file, not in sd_hijack.
WIP!!
On a separate note... do you think the same thing could be added to img2img to offer better conformity to the original image? I sometime feel the aesthetic model is difficult to control. A some point it totally change the original image instead of changing the overall style of it. If it was possible to control the weight of the aesthetic on top of the resulting prompt image without it without losing the whole look it would be even better.
Added
I like the now expandable section for the aesthetic section. This is a step in the right direction and I hope Automatic will approve of it.
I tested the img2img implementation and it work very well. I was able to keep the general composition of the ofiginal and transform it toward the aesthetic without losing too much... NICE. Here is an example of applying the Big Eyes style to a man photo:
Original:
Styled with big eyes:
and the overall config:
Trying to apply the same aesthetic on the source text2img with same seed would result in this... which is not what I want:
I think the better workflow is:
- Use text2img to get a good starting image (or just use an external image as a source)
- send it to img2img
- apply the aesthetic changes there and tweak to taste
Something else I noticed. Is there a reason the Aesthetic optimization is always computed? If no parameters for it have changed from generation to generation, could it not just be used from memory cache instead of always being recomputed?
Something else I noticed. Is there a reason the Aesthetic optimization is always computed? If no parameters for it have changed from generation to generation, could it not just be used from memory cache instead of always being recomputed?
When the seed changes so does the training result!!!
@bmaltais Looking at the original aesthetic gradients repo, the personalization step involves performing gradient descent to make the prompt embedding more similar to the aesthetic embedding. In other words, it has to be recomputed for each prompt. ~~But it shouldn't be affected by the seed as far as I can tell.~~ Actually, isn't the process nondeterministic regardless of seed unless you enable determinism in pytorch itself? Can someone test if running the same settings twice produces the same image?
I think there should be an option to do the Aesthetic optimization on cpu, before sending it back to the gpu for the image generation process. This might be useful for people with limited vram, so that they won't run out of vram when computing the Aesthetic optimization
Is there a tutorial on how to set this up/train it?
Have a look over here: Using Aesthetic Images Embeddings to improve Dreambooth or TI results · Discussion #3350 · AUTOMATIC1111/stable-diffusion-webui (github.com) https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/3350
On Fri, Oct 21, 2022 at 11:36 AM becausereasons @.***> wrote:
Is there a tutorial on how to set this up/train it?
— Reply to this email directly, view it on GitHub https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/2585#issuecomment-1287131531, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZA34T4P2W7UYRGZ3DCM7TWEKZ7HANCNFSM6AAAAAARFBBXIE . You are receiving this because you were mentioned.Message ID: @.***>
So is there any hope to do this on 4GB of VRAM? My poor card has been able to handle everything(besides training) up to 576x576 so far with --medvram, VAEs, hypernetworks, upscalers, etc, but this puts me OOM after the first pass. :sweat_smile:
It seems like "Aesthetic text for imgs" and slerp angle are somehow off... Values between 0.001 and 0.02 seem to cause the aesthetic text to influence the embedding in a meaningful way. But 0.2 to 1.0 seem random and not to have that much effect relative to each other. If I use "colorful painting", for instance (0.0 = ignore text, 0.001 = it adds color and flowers, 0.2 to 1.0 = the image seems to lose style altogther, and is neither colorful nor painterly.
The Dalle2 paper specifies that the max angle to use is in between [0.25,0.5]. (TextDiff)