ControlNet
ControlNet copied to clipboard
How to ControlNet with semantic (non-pixel-aligned) concepts?
I have a different ControlNet case, and I'd love feedback on how to get it working
I am trying to train a ControlNet from SignWriting (lexical writing of sign language) and illustrations. (https://github.com/sign-language-processing/signwriting-illustration) Unlike other ControlNet examples where the control is pixel-aligned, here, my image represents semantically what needs to be generated, similar to the text.
Every video in my corpus is annotated with SignWriting, and an illustration. All of the information about the sign is indeed represented in SignWriting.
I create the prompts using gpt-4-vision to include additional information, but no information about the sign (for example, gender, but not the direction of the hands)
How my dataset looks like:
All images are then created at 512x512, for example:
An illustration of a woman with short hair, with orange arrows. The background is white and there is a watermark text '@signecriture.org
| control | illustration |
|---|---|
Results
Using the above example control with the prompt An illustration of a man with short hair, with orange arrows. The background is white., I tried generating 5 different illustrations. My expectation is that the appearance of the person might change somewhat, but that the hand positions and arrow directions should be consistent with the SignWriting representation.
Training with sd_locked=True, only_mid_control=False yields a system that can not adequately illustrate:
I played with ddim_steps, scale and strength but all are consistently bad
Training with sd_locked=False, only_mid_control=False, the system can now illustrate, but the results are not consistent:
Playing with ddim_steps, scale and strength does not change the results much
Is there anything I can do to get it working?
Wow, that's a very interesting usecase and dataset! Here are a few thoughts:
- start with a simpler controlnet like canny edge detection and try to make it work. it's much simpler to analyze the result because it's much clearer what the outcome should be and you can compare it to the official canny model. this helps to get a feeling on batch- and epoch sizes, other parameters and problems of CNs. key question: did you reach convergence?
- the general concept is related to OpenPose "generate a image of a person guided by a few keypoints". you may find some answers by analyzing the openpose training. one thing I wonder for example: how much information does the prompt provide for the training if it's basically the same all the time and which prompts did they use in openpose? Another thing I noticed. The "openpose keypoints projected onto an image" have a spatial relationship with the generated image, whereas the signtext is more like a text-prompt, maybe you should like into AltDiffusion (=using a different text prompt language for SD) or fine-tuning SD?
- try to generate images without controlnet to get a baseline for SD's ability to generate illustrations (see your strength=0 example, the style won't get better than that). I would assume it's bad at it. there a some custom loras just to generate illustrations and pictografic humans. this helps to differenciate what is SD's fault and what is the CN's fault.
- SD is notorisly bad at generating text even though it had plenty of training data. I would assume it's even worse on generating sign writing "text". An interesting sub-problem for example could be: can we train a controlnet on latin letters?
- if you just want to get it to work.. you may want to try a totally different approach split the problem into different sub-steps: if SD is bad a generating character and arrow illustrations then.. step 1: use signwriting to generate a pose only, step 2: use a lora to get the character illustration style, step 3: compose the arrow illustration with classic image processing from a set of SVG images onto the images. your arrows always look the same, so why generate them?
- please provide your training parameters and dataset info (batch size, epoch size, number of images etc.)
- btw: I don't think there is a point in varying the strength or using high step sizes right now. weight=1 and steps=20 should be good enough with UniPC sampler.
Thank you for the very detailed response.
General Information
I used tutorial_train_sd21.py. My network often reproduces the exact training images in test time (depending on the initialization), so I would say - yes, it has converged.
As for AltDiffusion, while SignWriting is text, it is represented in 2D. I never managed to make a text encoder that works well with it.
Further experiments
Generating illustrations with native SD as a starting point sounds like a promising direction. I would love help on this -
Following https://www.reddit.com/r/StableDiffusion/comments/x8vxui/i_discovered_an_easy_way_to_force_a_line_drawing/, I played with the x_T for the model with sd_locked=True:
| init | code | image |
|---|---|---|
| default noise | torch.randn(shape) |
|
| completely white | torch.full(shape, fill_value=1.0) |
|
| noise around white | torch.randn(shape) + 1 |
Did I misunderstand the reddit post? I did not change the strength they are referring to, because I'm not sure what it is (it is not the ControlNet strength, I thought it might be the temperature but it does not seem like it is)
Comparison to OpenPose
It is a good idea to compare my network to the OpenPose training run, and propose a new training:
| Feature | ControlNet OpenPose | ControlNet SignWriting | ControlNet SignWriting (proposal) |
|---|---|---|---|
| Data size | 200K | 800 | 2.7K (waiting for GPT-4 quota) |
| Base model | Stable Diffusion 1.5 | Stable Diffusion 2.1 | Stable Diffusion 1.5 + LORA |
| GPU | NVIDIA A100 80GB | NVIDIA A100 80GB | NVIDIA A100 80GB |
| GPU-hours | 300 | 72 | 300 |
| Batch size | 32 | 4 | 32 |
| Learning rate | 1e-5 | 1e-5 | 1e-5 |
| Prompts | image alts | auto-generated | auto-generated |
| EMA Weights | Yes | No (use ema as tutorial) |
No |
sd_locked |
No | Yes | No |
Any suggestions on things to change in the proposal? (Except for dataset size, this is all I have at the moment)
I think the reddit post uses img2img and the strength is referring to the denoising strength. While it is interesting that it works, I would recommend a Lora to get a consistent style. Look here https://civitai.com/search/models?baseModel=SD%201.5&modelType=LORA&sortBy=models_v5%3Ametrics.weightedRating%3Adesc&query=illustration (just an example: https://civitai.com/models/124933/japanesestyleminimalistlineillustrations but you may find even better ones). you then should be able to hook it up with a pose controlnet to get the character in the pose you want.
Thanks @geroldmeisinger :) I updated my plan above to try using 1.5 + this illustration LORA. I am still interested at trying that reddit trick, if nothing else just to learn more.
Could you possibly point me anywhere in the code where there is this img2img conditioning in the ddim sampler, where I could hack in a default white image? (minimally for inference, but ideally also for training)
you don't have to use 1.5 specifically, it just has more loras. img2img is a pipeline. I don't know where you find it in code. it also depends on which framework you are using (in diffusers there is specific pipeline for img2img). in A1111 there is a tab for img2img which might be the easist way if you "just want to try that reddit trick".
@AmitMY Couple of things I notice that could cause the issue.
- Your dataset size is small. Based on the controlnet paper, the smallest dataset they trained on was 50k.
- I am not sure if controlnet is really meant for non-spatial conditioning, most of the controls by community or paper is based on spatial conditioning. I wonder if this is something in the architecture itself that prevents non-spatial conditioning, I wonder if the addition operations of the control unet decoder + sd unet decoder allows only spatial conditioning to be added.
You probably can try IPAdapter with clip embeddings of the sign-writing to inject that information into the Unet.