stable-diffusion-webui icon indicating copy to clipboard operation
stable-diffusion-webui copied to clipboard

Implementation of Stable Diffusion with Aesthetic Gradients

Open MalumaDev opened this issue 1 year ago • 25 comments

here the original repo: https://github.com/vicgalle/stable-diffusion-aesthetic-gradients

Immagine 2022-10-14 105447 Immagine 2022-10-14 105513

MalumaDev avatar Oct 14 '22 09:10 MalumaDev

someone is working on this https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/2498 should probably review and see whats different

TingTingin avatar Oct 14 '22 10:10 TingTingin

File "D:\stable-diffusion-webui-aesthetic\modules\sd_hijack.py", line 411, in forward
    z = z * (1 - self.aesthetic_weight) + zn * self.aesthetic_weight
RuntimeError: The size of tensor a (154) must match the size of tensor b (77) at non-singleton dimension 1

It seems that the token length is limited by the CLIP model.

ShadowPower avatar Oct 14 '22 17:10 ShadowPower

This seems to work well, but the default values are a bit odd.

The repo recommends an aesthetic learning rate of 0.0001, but you default to 0.005 which is an order of magnitude higher. Is there a specific reason for this?

Similarly for aesthetic steps the repo recommends starting with relatively small step amounts, but the default in this PR is the highest value that the UI is set to allow.

EliEron avatar Oct 14 '22 20:10 EliEron

To be quick, I put "random" default values 😅 I fixed the problem of the token length and I added the UI for the generation of the embedding. I need some hours of sleep, tomorrow I'll commit the code

MalumaDev avatar Oct 14 '22 21:10 MalumaDev

This feature is actually way more interesting than I thought. Pretty amazing the variations you can obtain using the images embeddings. I am still trying to figure out how to use all the different sliders and what they do... I really hope this will get merged someday.

I notices creating a new image embedding does not automatically get added to the pull down in text2img. Just a nit pick.

bmaltais avatar Oct 15 '22 16:10 bmaltais

Quick example for those wondering. I created an image embedding from a bunch of big eyes paintings and tried to apply it to the simple "a beautiful woman" seed 0 prompt. Here are the results:

Original prompt image: image

Applying the image embedding style with aesthetic: learning rate 0.001, weight 0.85 and steps 40: image

Increasing the weight to 1 increasing the style application resulting in something closer to the original paintings: image

Bringing it down to 0.5 will obviously reduce the effect: image

And the beauty is that it requires almost no computing time. This is next level stuff... Magic!!!

bmaltais avatar Oct 15 '22 17:10 bmaltais

Another example using the same prompt as above. I created an image embedding from a bunch of images at: https://lexica.art/?q=aadb4a24-2469-47d8-9497-cafc1f513071

After some fine tuning of the weights and learning rate I was able to get: image

And from those https://lexica.art/?q=1f5ef1e0-9f3a-48b8-9062-d9120ba09274 I got:

image

And all this with literally no training what so ever. AMAZING!

bmaltais avatar Oct 15 '22 17:10 bmaltais

This feature is actually way more interesting than I thought. Pretty amazing the variations you can obtain using the images embeddings. I am still trying to figure out how to use all the different sliders and what they do... I really hope this will get merged someday.

I notices creating a new image embedding does not automatically get added to the pull down in text2img. Just a nit pick.

Little bug. I'll fix it.

MalumaDev avatar Oct 15 '22 17:10 MalumaDev

I even tried feeding it 19 pictures of me in a non 1:1 aspect ratio (512x640) and gosh darn... if produced passable results!

Sample input image:

00000-0-a man with a beard and a white shirt is smiling at the camera with a waterfall in the background

Prompt with no Aesthetic applied:

image

Aesthetic applied:

image

Not as good as if I trained Dreambooth or TI but for a 1-minute fiddling it is amazing. It appears to apply the overall pose of some of the pictures I fed it. I wonder what would happen if I fed the thing with 100= photos of me in varying size... It is as if the size and ratio of images you feed it does not matter.

And what is amazing is that it does all this with a 4KB file!

bmaltais avatar Oct 15 '22 18:10 bmaltais

I'd suggest hiding the interface behind the Extra checkbox or at least moving it lower. It's quite large and pushes more commonly used options like CFG and Batch size/count off-screen.

feffy380 avatar Oct 15 '22 23:10 feffy380

I'd suggest hiding the interface behind the Extra checkbox or at least moving it lower. It's quite large and pushes more commonly used options like CFG and Batch size/count off-screen.

Indeed. I doubt Automatic will like it where it is now... best would be some sort of tabs inside the parameter section to present the current options in a default tab and access the aesthetic options in an aesthetic tab beside it.

bmaltais avatar Oct 16 '22 00:10 bmaltais

On a separate note... do you think the same thing could be added to img2img to offer better conformity to the original image? I sometime feel the aesthetic model is difficult to control. A some point it totally change the original image instead of changing the overall style of it. If it was possible to control the weight of the aesthetic on top of the resulting prompt image without it without losing the whole look it would be even better.

bmaltais avatar Oct 16 '22 00:10 bmaltais

Another quick test. Old bearded man:

Prompt no aesthetic:

00878-2002862293-morgan freeman starring as gandalf in lord of the rings, epic dark fantasy horror stylized oil painting by ivan shiskin  trendin

Aesthetic applied:

00876-3351746598-morgan freeman starring as gandalf in lord of the rings, epic dark fantasy horror stylized oil painting by ivan shiskin  trendin

bmaltais avatar Oct 16 '22 02:10 bmaltais

On a separate note... do you think the same thing could be added to img2img to offer better conformity to the original image? I sometime feel the aesthetic model is difficult to control. A some point it totally change the original image instead of changing the overall style of it. If it was possible to control the weight of the aesthetic on top of the resulting prompt image without it without losing the whole look it would be even better.

Yes, I think that it could work right now but I have not added the UI.

MalumaDev avatar Oct 16 '22 06:10 MalumaDev

I'd suggest hiding the interface behind the Extra checkbox or at least moving it lower. It's quite large and pushes more commonly used options like CFG and Batch size/count off-screen.

Today, I will move the board and compact the interface. I think it makes more sense near the prompts, but it can go back there again in the future.

MalumaDev avatar Oct 16 '22 06:10 MalumaDev

While this really is nice work, I definitely will not accept code that clutters the UI for users who don't want to use this, and I won't accept changes in code where you just take an existing line and change formatting of it without changing what it does.

Changing PIL.Image.BICUBIC to PIL.Image.Resampling.BICUBIC will break some old version of PIL on colabs so do not do that.

Why is there another CLIP being created when we already have one? If it is really needed for this, why is it always created regardless of whether the user wants the gradients?

AUTOMATIC1111 avatar Oct 16 '22 07:10 AUTOMATIC1111

An additional thing I'm going to ask of you is to isolate as much of your code into separate files as possible. The big chunk of code in sd_hijack should be in its own file. All the parameters of aesthetic gradients should be in members of your own class defined in your own file, not in sd_hijack.

One possible solution for non-cluttered UI is to let use specify an aesthetic embedding as text in the prompt; something like this:

a tree <aesthetic:weight=0.8, steps=30, slerp>

This will also have the benefit of putting all paramters into infotext, so that other users you share your prompt with will be able to reproduce it if they have the embedding.

AUTOMATIC1111 avatar Oct 16 '22 07:10 AUTOMATIC1111

Why is there another CLIP being created when we already have one? If it is really needed for this, why is it always created regardless of whether the user wants the gradients?

The only clip that I found is the CLIPTextModel, but we need also the text_projection that is in the CLIPModel class and the CLIPVisualModel to generate the embs.

MalumaDev avatar Oct 16 '22 08:10 MalumaDev

An additional thing I'm going to ask of you is to isolate as much of your code into separate files as possible. The big chunk of code in sd_hijack should be in its own file. All the parameters of aesthetic gradients should be in members of your own class defined in your own file, not in sd_hijack.

WIP!!

MalumaDev avatar Oct 16 '22 11:10 MalumaDev

On a separate note... do you think the same thing could be added to img2img to offer better conformity to the original image? I sometime feel the aesthetic model is difficult to control. A some point it totally change the original image instead of changing the overall style of it. If it was possible to control the weight of the aesthetic on top of the resulting prompt image without it without losing the whole look it would be even better.

Added

MalumaDev avatar Oct 16 '22 15:10 MalumaDev

I like the now expandable section for the aesthetic section. This is a step in the right direction and I hope Automatic will approve of it.

I tested the img2img implementation and it work very well. I was able to keep the general composition of the ofiginal and transform it toward the aesthetic without losing too much... NICE. Here is an example of applying the Big Eyes style to a man photo:

Original:

image

Styled with big eyes:

image

and the overall config:

image

Trying to apply the same aesthetic on the source text2img with same seed would result in this... which is not what I want:

image

I think the better workflow is:

  • Use text2img to get a good starting image (or just use an external image as a source)
  • send it to img2img
  • apply the aesthetic changes there and tweak to taste

bmaltais avatar Oct 16 '22 20:10 bmaltais

Something else I noticed. Is there a reason the Aesthetic optimization is always computed? If no parameters for it have changed from generation to generation, could it not just be used from memory cache instead of always being recomputed?

bmaltais avatar Oct 16 '22 20:10 bmaltais

Something else I noticed. Is there a reason the Aesthetic optimization is always computed? If no parameters for it have changed from generation to generation, could it not just be used from memory cache instead of always being recomputed?

When the seed changes so does the training result!!!

MalumaDev avatar Oct 17 '22 07:10 MalumaDev

@bmaltais Looking at the original aesthetic gradients repo, the personalization step involves performing gradient descent to make the prompt embedding more similar to the aesthetic embedding. In other words, it has to be recomputed for each prompt. ~~But it shouldn't be affected by the seed as far as I can tell.~~ Actually, isn't the process nondeterministic regardless of seed unless you enable determinism in pytorch itself? Can someone test if running the same settings twice produces the same image?

feffy380 avatar Oct 17 '22 10:10 feffy380

I think there should be an option to do the Aesthetic optimization on cpu, before sending it back to the gpu for the image generation process. This might be useful for people with limited vram, so that they won't run out of vram when computing the Aesthetic optimization

miaw24 avatar Oct 20 '22 04:10 miaw24

Is there a tutorial on how to set this up/train it?

bbecausereasonss avatar Oct 21 '22 15:10 bbecausereasonss

Have a look over here: Using Aesthetic Images Embeddings to improve Dreambooth or TI results · Discussion #3350 · AUTOMATIC1111/stable-diffusion-webui (github.com) https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/3350

On Fri, Oct 21, 2022 at 11:36 AM becausereasons @.***> wrote:

Is there a tutorial on how to set this up/train it?

— Reply to this email directly, view it on GitHub https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/2585#issuecomment-1287131531, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZA34T4P2W7UYRGZ3DCM7TWEKZ7HANCNFSM6AAAAAARFBBXIE . You are receiving this because you were mentioned.Message ID: @.***>

bmaltais avatar Oct 21 '22 16:10 bmaltais

So is there any hope to do this on 4GB of VRAM? My poor card has been able to handle everything(besides training) up to 576x576 so far with --medvram, VAEs, hypernetworks, upscalers, etc, but this puts me OOM after the first pass. :sweat_smile:

rabidcopy avatar Oct 21 '22 22:10 rabidcopy

It seems like "Aesthetic text for imgs" and slerp angle are somehow off... Values between 0.001 and 0.02 seem to cause the aesthetic text to influence the embedding in a meaningful way. But 0.2 to 1.0 seem random and not to have that much effect relative to each other. If I use "colorful painting", for instance (0.0 = ignore text, 0.001 = it adds color and flowers, 0.2 to 1.0 = the image seems to lose style altogther, and is neither colorful nor painterly.

TinyBeeman avatar Oct 22 '22 00:10 TinyBeeman

The Dalle2 paper specifies that the max angle to use is in between [0.25,0.5]. (TextDiff)

MalumaDev avatar Oct 22 '22 05:10 MalumaDev