krita-ai-diffusion
krita-ai-diffusion copied to clipboard
Regional prompting / Attention couple proof of concept
Hello,
Using the custom node cgem156-ComfyUI , I was able to have a functional regional prompting workflow for SDXL.
Here is a proof of concept in krita-ai-diffusion, using a new control layer type "Attention", and splitting the prompt in lines and parsing for ZONE starting lines (or BREAK, PROMPT, ATT, and an optional number... TBD) and new Paint layer in Krita we are able to greatly influence the rendering.
Here is a step by step :
- Create a new Paint layer in Krita
- Draw/fill the region you want to control (any color, any opacity, we use the transparency mask directly)
- Add a new "Attention" control layer in krita-ai-diffusion interface
- Affect the new Paint layer to the control layer
- Repeat for any number of zones
- Alter the prompt:
- Add as many lines as zones starting by
ZONEthen describe the content - Optionally add an extra
ZONEline that will be applied only to the image outside the defined zones - The first line of the prompt is copied at the beginning of all prompts
- The last line of the prompt is copied at the end of all prompts
- Add as many lines as zones starting by
An example with a single zone :
photography of
ZONE a dog
ZONE a cat
two animals, a city street in the background
The second ZONE is automatically affected to the cat prompt. The prompts used as attention couple are : "photography of a dog, two animals, a city street in the background" and "photography of a cat, two animals, a city street in the background".
Another example:
To do :
- [x] Add a control layer mode "Attention"
- [x] Get the control layer image opacity as mask
- [x] Split main prompt by attention zones
- [x] Apply cond and masks to
AttentionCouple - [ ] Describe resource and auto install
cgem156-ComfyUI - [x] Handle render methods
- [x] generate
- [x] refine
- [x] inpaint
- [x] refine_region
- [x] User switch for attention couple / region prompt finder
- [x] Load LoRA only once
- [ ] Typechecks
- [ ] Lint
If you want to check my full ComfuUI Regional Prompting that inspired this PR : https://www.reddit.com/r/StableDiffusion/comments/1c7eaza/comfyui_easy_regional_prompting_workflow_3/
Note : The AttentionCouple node use a JS trick where you have to right click then use "add input", so a dumped workflow from krita-ai-diffusion is unusable until you connect the correct mask & cond.
This development be of interest to : #386 #387 #567
[edit] : Mentioned the PR in those to get some potential testers. 😅
An example for #387 :
sorry if it s too nooby, but how do i implement this? thanks in advance.
Improved the coherency by subtracting the upper masks from each control layer mask. This allow to better respect the order of the control layers.
This only affect images where there is a covering between two control layers. If you put a control layer below another one covering it entirely, as intended it will have an empty mask and be ignored.
It could be an option, but from my experience it works better that way, otherwise smaller zones are frequently ignored in favor of bigger covering ones.
I tried manually computing masks intersections to combine prompts, but the difference in result is marginal, and the computation by nodes would be a nightmare.
Hi, good to see you're still around :)
I hadn't used this node before, it looks like a good trade-off between following prompts of individual regions and still achieving some full image coherence. It felt to me like the prompt space was still limited by the combination of all region prompts: when I make them too detailed, part of the prompts are dropped quickly (more so than if I was inpainting that region exclusively).
What kind of workflow do you have in mind? My main question is, can we combine this with inpaint workflows in a synergetic way rather than creating an alternative?
By dealing with areas mainly in the prompt, we can't easily assign other control layers to specific regions. It would be great to have a UI which allows that (makes most sense for IP-Adapter, ControlNet is "area-aware" by definition but it could still be useful to compose/mask it).
Some more practical questions:
- What is the plan for 2-pass generation? Currently it doesn't work, not sure if the attention couple has to be applied to the high-res pass too, but it might...
- If there's a selection (inpaint & refine region) will that filter out areas which aren't covered by the selection automatically? That could be very useful for working on documents with fine-granular regions, since there seems to be a limit on how many of them make sense to include in one generation.
- And related to the above, if there are too many regions, can they be automatically merged? (Okay maybe this is a not-so-practical question)
Hi there !
I admit in my own workflows this method of rendering come kinda before the inpaint stage : when I'm still exploring the prompts and seeds. It allows to do quick changes at minimal cost (redraw slightly the masks, alter some of the zones prompts, try a new seed) and the resulting image is generally quite consistent with what I want, and with a great overall coherency because of the full rendering of the image.
When I'm happy with said image is when I switch to the inpainting / region refining stage. Then I don't use the regional prompting anymore.
As you see the use case are quite separated, in my mind at last. It is difficult to use once doing refining passes, because it adds complexity by prompting and defining masks when what you are doing is quick alterations to different zones.
If we changed the UI to have a separated prompt by layer, and automatically calculating the attention by affected repaint region, it could maybe also be useful in the refining stage ; but the complexity of code would be an order of magnitude higher, for a somewhat unclear gain compared to multiple inpaintings.
What kind of workflow do you have in mind? My main question is, can we combine this with inpaint workflows in a synergetic way rather than creating an alternative?
Weeeeell now that you say it like that, I was kinda creating an alternative to inpainting for quick iterations. 😅
when I make them too detailed, part of the prompts are dropped quickly (more so than if I was inpainting that region exclusively)
I did not feel this effect, but it may be subjective : by splitting the prompt in many smaller zone prompts I feel the effect of each is stronger of course. But I'm always a very cautious prompter, trying to get an effect with the minimum number of tokens. Thus I may not be the best judge of prompt complexity.
What is the plan for 2-pass generation? Currently it doesn't work, not sure if the attention couple has to be applied to the high-res pass too, but it might
I successfully added an upscale pass to my ComfyUI worfklow : Danamir Regional Prompting v12.json , but I had to resize each mask individually to keep the zones attention on the correct upscaled zone. Should be the same here, scaling the masks should suffice in theory. When I tried without attention on the second pass, the results were definitely lacking details, so there is really a gain by applying it to the first and second pass.
If there's a selection (inpaint & refine region) will that filter out areas which aren't covered by the selection automatically? That could be very useful for working on documents with fine-granular regions, since there seems to be a limit on how many of them make sense to include in one generation.
For now I was planning to leave the selection rendering alone, because of the single prompt UI. As I was saying at above, the interface and workflows would have to be quite complex to keep those same regions when inpainting.
And related to the above, if there are too many regions, can they be automatically merged?
I guess we could imagine a way to merge regions, but what would be the metric to decide ? I would rather have the user merge/remove some of his regions if he feels there are too many.
When I'm happy with said image is when I switch to the inpainting / region refining stage. Then I don't use the regional prompting anymore.
That makes sense. What I'm interested in is not so much keeping the attention coupling up also in the later stages (although it might be nice here and there), but how they require a similar setup: different prompts are attached to different regions.
The current setup for refinement is adequate if you start with one region, tweak it until it's perfect, and then move to the next. But I think that's not commonly how people work, rather you tend to have a back and forth between subjects in the image, revisiting previous prompt setups. And it's quite tedious to redo the setup every time.
Live painting has a similar issue, you end up with short prompts that relate to one part of the image. You do a very basic sketch, then move on to another part that requires a different prompt, and so on. Once you're somewhat happy with the composition you go back and add some detail to all the parts, then do another pass for simple shading hints, etc.
Once you're convinced SD has more or less understood what you're trying to do, you might want to run it through high quality generation, find a good seed. It would be nice if those prompts you setup for all those areas were available now to use with the attention coupling!
After you have a coherent image, you might want to refine details at low strength, focusing on specific regions. Attention coupling may no longer be needed, but those prompts you set up are still useful.
Finding good compositions is usually best at not-too-high resolution, to iterate fast. But at some point you want to increase rendering resolution. The current tiled upscaling is... very rudimentary. It doesn't use a prompt, but it's also almost impossible to put a good one, since it has to describe each tile. If you had prompts, IP-Adapter, CN, etc. attached to regions, you could automatically select the ones appropriate for a tile, and probably get much better results.
So what I'm looking for with this is mainly an infrastructure that allows us to reuse the regional prompts (and other setup) throughout the whole lifecycle. You may not always use all of those steps and features, ... but if you didn't have to waste time redoing the whole setup you might be tempted ;)
After this big motivational speech, some thoughts on how it might look like:
- Attach prompts (and CN, IPA) to layer groups (if using coupling: the group defines the mask region)
- Nested groups inherit the prompt (etc.) from their parent (if using coupling: used as background prompt)
- There is one "root" group
- If you select a child group and generate, only its region (+ some context) will be generated
- If you select the root group, the entire image is generated (using coupling for child regions)
- If you make a manual selection, that region is generated (maybe using coupling for affected regions only)
- It's hierarchical, you can nest as much as you like! But always the area for generation is the current selection, and the target for coupling only its direct children. This gives you a merging strategy.
I believe this approach fits how layers are used for painting. The big challenge is that diffusion is inherently layer-unaware. Any new generated image by default sits globally on top of the stack and overrides everything. Layer diffusion exists, but so far I find it somewhat clunky and incomplete. It may fit in there somehow, but I don't know how yet. So my initial approach to solve that would not rely on it, but rather use each group's mask to split new results. If the groups have a transparency mask, that would be as easy as adding results to all groups.
You'd have to manually edit them if the new generated content doesn't fit the original bounds, but it would not be hard. Or in a lot of stages, probably be okay with the mask not fitting the subject, just the general area.
Now this all sounds complex, but it's opt-in, you don't have to make your document more complicated than you need to. If it's fine with two groups, use that. If it doesn't need any, everything works as before.
Well, from a user perspective anyway. Implementation is a different matter.
Oooh, it's an interesting way of doing it for the future certainly. Although It's an entirely different beast to tackle than my proposition !
My current quick-and-dirty implementation has the benefit of being run in a single sampling pass, hence the lightness. In my ComfyUI workflow I'm even limited to a single LoRA, because adding one by region would alter the model multiple times, and that's not how the AttentionCouple custom node works. Your idea would require to develop all the logic of this node, and much more, to be able to apply each CN, mask, LoRA, IPAdapter and such, to each region. I suppose it would also necessitate many sampling passes, according to what control layers and regions are active.
As much as I would love to see this implemented, I'm not sure how much I could contribute to such a development. I was trying a few hours ago to alter the control layer's generate button to change its behavior to create a new paint layer in Krita with a nice name, filling the current selection with a flat random color at 50% opacity... and failed miserably. 😬 I can't wrap my head around how the jobs are called when you don't really want a job to be launched on the server.
I'm not suggesting a replacement for what you're doing, I think single-pass with multiple prompts is useful (even with the lora limitation). But I would like it to use a shared UI/setup with other potential methods. Not all of this has to be implemented immediately.
I will try out some ideas for UI. Either way the code you write for backend workflow generation will be very useful. Maybe don't focus on UI too much.
I can't wrap my head around how the jobs are called when you don't really want a job to be launched on the server.
Hm, I don't think you want anything to do with jobs in that case, they're for asynchronous background tasks. What you're trying to do can probably be done directly on button press?
Maybe don't focus on UI too much.
Not a problem ! I tried to get Krita to add a text input inside each control layer and quickly left it aside. The mix of Qt and Krita package is not my cup of tea.
Hm, I don't think you want anything to do with jobs in that case
Ok, I was wondering about this. Everything should be able to be done in Krita, but I didn't know if the Job finishing callback part was mandatory for the UI update.
If you find a way to add LoRa, embeddings, CN and IPA into that, it would be abolutely perfect.
This PR is really cool and exciting. Great work :)
Have y'all seen https://www.youtube.com/watch?v=4jq6VQHyXjg ? (Latent Vision/Mateo new IPAdapter nodes to simplify masked attention).
Have y'all seen https://www.youtube.com/watch?v=4jq6VQHyXjg ? (Latent Vision/Mateo new IPAdapter nodes to simplify masked attention).
Hadn't seen it. It looks like it's not a new method though, just conditioning masks and IPA attention masks wrapped into one node for convenience.
@Danamir I guess you must have tried conditioning masks too, I was never very happy with them and the attention couple seems to work a bit better, but really haven't done systematic test. How big is the difference? Also, did you try attention couple + (masked) IPA? It would be good to know that it doesn't somehow interfere with each other.
tried conditioning masks too, I was never very happy with them and the attention couple seems to work a bit better
I totally agree, I started this PR development because I was happy with how Attention Couple was working compared to other methods.
Also, did you try attention couple + (masked) IPA
I tried the worklfow available in the posted video on it's own (with various success), but I didn't think of combining the two. I'll check if this is working.
Got some random RuntimeError from time to time that could be an issue on my system. But then when it's working, it's working pretty well. I got normal IPAdapter and FaceID working. Just be sure to send the model out from Attention Couple and into IPAdapter Unified Loader, and not the other way around.
I created a branch: https://github.com/Acly/krita-ai-diffusion/tree/regions
It has some UI (draft) for regions which are attached to Krita group layers. Basic setup looks like this:
Can be streamlined more and missing a lot of things, but it should fill out the new Conditioning structure with regions. Maybe you can rebase your PR on top of that and give it a try. The Region.mask should load directly with load_mask.
Very neat UI. I tried it and have a few notes :
Could it be possible to have an option to add two more prompts :
- A "remaining mask" (or "leftover mask") prompt affecting only the region not masked.
- Instead of the the background prompt, separated "starting" and "ending" prompts.
From my experience those actions can be done manually but are a great enhancement in day-to-day usage. Leaving it as an option would prevent UI cluttering. Otherwise I could still use prompt keywords to detect those sub-prompts, but it kinda defeats the new feature.
Finally, my setup of Krita crashes when I create an empty "Group Layer" manually, it only works when creating a group on an existing layer.
-------------------
Error occurred on Monday, April 29, 2024 at 01:57:13.
krita.exe caused a Stack Overflow at location 00007FF910D5B8C7 in module python310.dll.
Could it be possible to have an option to add two more prompts :
Maybe the way the "background" is implemented right now doesn't make too much sense, it works in a hierachical way, whereas layers are first and foremost a stack.
I think if you want a prompt that covers everything that isn't covered by a group/layer on top, you can just make another group at the bottom and fill. Maybe it needs to be more streamlined.
Technically the style prompt can already be used as the before/after prompts with the scheme "before {prompt} after". The before probably often happens to be style related, but I can see why you'd want to add some shared context after each region prompt.
Finally, my setup of Krita crashes when I create an empty "Group Layer" manually
Right, I fixed some stuff, probably there is some more, syncing regions and layers is a bit more fragile than I'd like...
I think if you want a prompt that covers everything that isn't covered by a group/layer on top, you can just make another group at the bottom and fill. Maybe it needs to be more streamlined.
That is what I meant by "those actions can be done manually". At use, I found that almost every time I would have to use a "background" layer, so having it directly done by the UI simply removes a redundant step.
The before probably often happens to be style related
True, it is mainly a style related feature. But it can also be useful for LoRA activation keywords : those sometimes benefits to be placed at the front of the prompt. And it can also be relevant for some composition keywords that could be ignored if placed at the end of a long prompt (ie. full body, portrait, mid shot, cowboy shot...).
Then again, I could simply detect the {prompt} keyword and do a simple replace. It would have to be done before applying the style to keep everything intact. 😅
I tweaked the UI and how the "root" region works a bit. The intended use for the bottom prompt is now to be appended to all regions. I think optionally supporting {prompt} there would be okay too. So if you have
- region =
region prompt - common (root) =
common begin {prompt} common end - style =
style begin {prompt} style end
the complete prompt for region would then be style begin common begin region prompt common end style end
Maybe also use the "common" prompt as the background for now, still undecided on a special background text field. It's a lot of text fields. It depends on how easy it is to add regions in the end, I will try out a button for that next.
I like how the regions are handled. We can clearly see what control layer affect which regions. Does this mean that we won't expect control layer on multiple regions ? For the region prompting part in any case it's perfect.
I still have a weird bug if I try to use directly the "Add Group Layer" where it create a group and a layer at the same time, but the group cannot be unfolded, and the layer cannot be edited. But the "Add Paint Layer" + "Group Layer" is working as intended.
I also tried a nested group, the dedicated prompt of the nested group disappears when I click on another group, but it's generating the prompt in the workflow. I don't know if this is the expected behavior.
Questions :
- Is there a way to get the merged / subtracted / remaining background masks directly in Krita, by using Pillow or otherwise ; or do I try to compute it in the workflow ?
- Do we still add an "Attention" control layer for each region we want to be affected, or do we get rid of this control layer type and affect an attention couple prompt to each region by default ?
- The first method offers more fine control on how each region are prompted, but will necessitate to merge any sub-groups masks I suppose
- The second method seems easier to compute (one attention couple = one mask) but could potentially leave empty prompts (which could not be a problem if the common prompt is filled)
- For the remaining mask, would you rather have to create a group layer after all the others, then manually fill entirely the paint layer, or automatically assume that the last empty group layer is de facto the background one ?
- The first choice is the more logical one, but it can be weird having to fill a full layer only for having part of it effective.
- The second choice is faster, but maybe not intuitive.
- In any case the remaining mask will have to be computed, either in Krita or the workflow.
I'll try the rebase and see what I can get working.
I still have a weird bug if I try to use directly the "Add Group Layer" where it create a group and a layer at the same time, but the group cannot be unfolded, and the layer cannot be edited.
Strange, when I use "Add Group Layer" it just adds an empty group, no paint layer.
Some answers and updates:
- Let's keep it simple: each group is a region, each region has 1 mask and should use attention couple. No special mask control layer.
- Control layers added to a region should be applied to the region area only. Control layers added to the common/root region apply to the whole image.
- If a region has no prompt or control layers I filter it out now. This allows to still use groups for other purposes.
- Changed it so only top-level groups are considered. The order is bottom to top.
- Will add nesting later, my idea would be that it can be used to control complexity: nested groups are considered if you inpaint the parent.
- I've implemented that for each mask, the overlapping parts of layers above it are subtracted. Not entirely sure it's correct yet.
- I also automatically add a background region if all the region masks added up don't cover the full image. But may still change this to something more explicit. A manual background region needs to be filled.
Let's keep it simple: each group is a region, each region has 1 mask and should use attention couple. No special mask control layer.
Nice, this will be easier to code. 😅
I'm wondering is there is a use case where the "group layer + mask + prompt" would be useful for a control layer, but incompatible with attention couple. I suppose there could be a checkbox in the "custom" interface do disable the attention couple entirely if needed, or maybe in the menu handling the seed and such.
Control layers added to a region should be applied to the region area only. Control layers added to the common/root region apply to the whole image.
Perfect. I did not think to try to add a control layer to the common prompt, as I though it was only a text input.
I've implemented that for each mask, the overlapping parts of layers above it are subtracted. Not entirely sure it's correct yet.
I'll inspect the generated workflows to see if the computed mask is ok. The attention couple node can be peculiar on which mask it accepts. There can be no gap and no overlap of total strength > 1.0. So There may be a need for a few mask manipulation in the workflow to ensure the combined masks are complete.
There is a nasty bug with the way accumulated_mask is initialized in RegionTree.to_api() . The way it is done actually :
for region in reversed(api_regions):
if accumulated_mask is None:
accumulated_mask = region.mask
creates a pointer to the mask, so if there is only one region, not fully covering the canvas, the next portion of the code :
if not fully_covered:
accumulated_mask.invert()
actually also inverts the first region mask !
I tried with a accumulated_mask = copy(region.mask) but it creates a shallow copy, not sufficient. Finally I simply added the mask to itself, it does nothing but it really creates a new image, preventing the later pointer overload :
if accumulated_mask is None:
accumulated_mask = Image.mask_add(region.mask, region.mask)
There is certainly a nicer way to achieve this, but it's working. 😅
Here, pushed a working version with the new Regions !
Still got a strange thing where I need to convert the mask to image then to mask again, otherwise ComfyUI complaints that it's in the [<width>, <height>] format instead of the expected [1, <width>, <height>] .
It's working pretty well, it would be even better if the thumbnail was showing the full image to see the region relative position to the other. Krita has the same problem in the layers display. An example, the blue region is on the left side, and the green one on the right, but it's impossible to see at a glance in the layers :
Updated the code to handle the calls to scale_refine_and_decode in generate. It allows for hires fix to work, correctly scaling the masks and applying the attention couple a second time.
This also fixes the previous mask load issue as the resize use an image conversion.
Sadly, the attention couple node has an issue with ratio other than 1:1 , most of the time it fails the upscale rendering if the ratio is not square.
[edit] : Deprecated comment, see below. It is more useful to only keep the more prevalent region prompt.
Using your method for comparing masks by subtraction then using an average pixel, I added a way to order the region prompts for the two generate method not supporting attention couple : refine region, and inpaint.
This allow to use only the region prompt under the current selection, ordering the prompt by mask coverage :
The drawn regions
The groups and regions
Resulting prompt : "blue red green"
Resulting prompt : "green red"
Resulting prompt : "blue"
Note concerning the LoRA loading from prompt :
- Bug : if the LoRA is defined in the common prompt, it is loaded multiple times
- ~~Feature : when using the region prompt finder in refine/inpaint, only the affected regions LoRA(s) are loaded, this could be useful~~ No sure if this is really needed. In any case at the moment when the lora are extracted they are removed from the prompts, so we don't have the information to unload the regions unwanted loras.
Finally, getting only the regional prompt with the most selection overlapping (still combined with the common prompt) seems better.
When trying to select multiple regions, too many tokens are selected and increase the risk of polluting the inpaint. More so because of the selection feathering that can extend the selected region further than necessary.