textual_inversion
textual_inversion copied to clipboard
Got weird results, not sure if I missed a step?
Hey @rinongal thank you so much for this amazing repo.
I trained with over 10K steps I believe, and around 7 images. (Trained on my face) Using this colab
I then used those pt files in running the SD version right in the collab and a weird thing happens, when I mention *
in my prompts, I get results that look identical to the photos in style, but it does try to ... draw the objects.
For example :
Prompt was portrait of joe biden with long hair and glasses eating a burger, detailed painting by da vinci
and
portrait of * with long hair and glasses eating a burger, detailed painting by da vinci
So SD added the glasses and the eating pose, but completely disregarded the detailed painting and da vincin and the style.
What could be causing this? Any idea? 🙏
Hey!
The most likely candidate is just that our SD version isn't officially released yet because it's not behaving well under new prompts :) It's placing too much weight on the new embedding, and too little on other words. We're still trying to work that out, but it wasn't as simple a port from LDM as we hoped. If this is the issue, you can try to work around it by repeating the parts of the prompt that it ignores, for example by adding "In the style of da vinci" again at the end of the prompt.
With that said, if you want to send me your images, I'll try training a model and seeing if I can get it to behave better.
Thank you! I'll try putting more weight on the later keywords! Don't think my images are anythings special or important to test with, I just took a few snapshots of myself ,and cropped to 512x512.
One thing I found that helps and/or fixes this scenario is adding periods to your prompts, not commas like in original SD repo. This may or may not be a bug.
So this: portrait of * with long hair and glasses eating a burger, detailed painting by da vinci
Should become this: portrait of * with long hair and glasses eating a burger. detailed painting by da vinci.
If you trained on one token, you could possibly add weight by doing something like portrait of * * ...rest
as well, but you'll get further away from the rest of your prompt
If you're using the web UI (i.e. this repo: https://github.com/hlky/stable-diffusion-webui ), you can specify weight to certain tokens as such:
A photo of *:100 smiling.
I frequently have to do this with the finetuned object, sometimes using astronomical values like 1000+. This can greatly improve likeness. You may also need to adjust classifier guidance and denoise strength. All of these parameters do impact each other, and changing one often means needing to re-calibrate the rest.
Anyhow, you can try applying strength to the part of the prompt that SD is ignoring. Something like this:
portrait of * with long hair and glasses eating a burger, detailed painting:10 by da vinci:10
If you're using the web UI
I'm one of the maintainers in charge of the frontend part but TBH I haven't yet added my own checkpoints to the webui! Will do that tomorrow
I will def try this! Thank you
I've found limited success in "diluting" the new token by making the prompt more vague - for exmple "a painting of *" results in pretty much the same image as just "*" on its own, but "a painting of a man who looks exactly like *" does (sometimes) work in succesfully applying a different style. Adding weights to the tokens as others have described also works, although it requires constant tweaking.
I don't know if it would be technically possible to test for style transfer during the training/validation phase; for example, on top of the 'preset' prompts that are used on the photos in the dataset, you would have a separate list of prompts like "A painting of *" that would be used to verify that an image generated with that prompt would also score high on the 'painting' token. In the DreamBooth paper, they describe that they combatted overfitting (which I guess is causing these issues) by also training 'negatively' - something which I've tried to rudimentally replicate by including prompts without the "*" in the list of predefined ones, but I don't think this would actually do anything since the mechanism for DreamBooth and Textual Inversion are very different.
If you guys had a single instance of succefully finetuning a photo likeness of human being into SD with this code please share, ive yet to see that and im almost sure that this code is not to "inject" your own face into SD model as people might think.
If you guys had a single instance of succefully finetuning a photo likeness of human being into SD with this code please share
I won't be sharing my model at this time, but I can tell you that this method is indeed capable of pulling off a convincing headswap under the right conditions:
- With photorealistic subjects (people), I have had better results when providing the model ~10 images and training longer than suggested (25-40k iterations). This could be a fluke, I'm sure the authors of the research paper know what they're talking about when they say 5 is the optimal number of images - but I'm not convinced it's always 5.
- txt2txt often produces mediocre and "samey" results with finetuned checkpoints. Try img2img instead. You'll get more variety in terms of facial expression and surprisingly higher fidelity in the face itself. Using photos as your img2img input is better than using simple drawings or other kinds of illustrations. Denoise strength should be between 0.4 to 0.75 depending on how large the face is in your input image (larger face = go for higher denoise strength.)
- Play around a lot with CFG and prompt weights. You can crank CFG to the 10-20 range to improve likeness at the cost of potentially introducing visual artifacts (can be counteracted to some extent by increasing inference steps). Likewise, you can apply more weight to your finetuned object by writing *:10 or *:100 etc, in your prompt.
- k_euler_a sampling method seems to be the best for photorealistic people.
- For best results, take an image from SD and throw it into a traditional faceswap solution like SimSwap or sber-swap.
Hope that helps.
Well... i already heard that , its not saying much without comparison of actual photo and sd output .Even paper doesnt have result with human subjects.Some people claimed to do it but then i looked at pics and sd output was not the person thats on training data images.Vaguely yes it was same skin colour, similar haircut but proportions of the face against nose and lips... all mixed up from result to result. So i stand by what i wrote, this method so far is not capable of finetuning a human likeness and synhesizing it in SD, until proven otherwise. I dont mind trainig for a long time, i just want to know if ill be wasting my time and blocking gpu for nothing if ill never be able to get at least 90% likeness. Almost all if not all results ive seen look like derivatives/mutations of the subjects and not like actual subject. Identity loss is one of the biggest issues in face synthesis and restoration.Few managed to solve it. I trained 3-4 subjects with about 30k iterations each, results were not succesfull ( well it did "learned" them but they looked like mutations of subjects)besides one with training a style that was bigger success, so for now id wait until i see someone pushing finetuning and proving it can be done and you can synthesize a finetuned face that looks like on original images.
Here's what you can try to verify that textual inversion can create a convincing likeness; First of all train at 256x256 pixels with larger batch sizes; depending on your GPU you can easily train 4x as fast so you'll see results sooner. The downside of this is that only the ddim sampler really works with the final result, but I feel like that's an acceptable tradeoff if your main goal is just to check whether or not it's even possible. Also bump up the num_vectors_per_token a bit; if you're not worried about overfitting you can even bump it up to ridiculous levels like 256 (edit: I've now learned that putting this higher than 77 is useless because SD has a limit of 77 tokens per input) - the result of this is that you'll get a convincing likeness way quicker, but it'll never deviate that much from the original photos and style transfer may be impossible.
I've fiddled a lot with all kinds of parameters and have gotten results that are all over the place; with the 256x256 method I can iterate pretty quickly but the end result is always overfitting. For example, most of the photos I used were in an outdoors setting and textual inversion thus inferred that being outdoors was such a key feature that it'd try to replicate the same outdoor settings for every generation. I thought that maybe adding A * man outdoors
(and variations) would help in separating the location from the token, but I feel that it only reinforces it because now generated images that are in an outdoors setting score even higher on matching the prompt.
I think that's largely where the problem lies; apart from the initial embedding from the 'initializer word', there's no way to 'steer' training towards a particular subject. When using a conditioning prompt like A * man outdoors with a red shirt
the conditioning algorithm doesn't know that it can disregard the "red shirt" part and that it should focus on the magic * that makes the difference between the encoding of a regular man and myself.
I don't know if it would be possible to basically train on two captions for each image; for example, we apply a * man outdoors in a red shirt
and a man outdoors in a red shirt
(without the *) and then take only the difference in the encoding instead of the entire thing.
The two things that had the most success for me are:
- Replace the template string with a single
{}
- Make sure you're using the
sd-v1-4-full-ema.ckpt
I'm almost positive that the reason for overfitting in SD is that the conditioning scheme is far too aggressive. Simply letting the model condition itself on the single init word alone is sufficient in my opinion, and has always lead to better results for me.
What's funny is that you're staying close to Stable Diffusion's ethos of heavy prompting, because conditioning this way makes it to where you have to come up with the correct prompt during inference, rather than let the conditioned templates do the work.
Even if you have low confidence in this method, I say it's most certainly worth looking into. I'm also certain that PTI integration will mitigate a lot of these issues (it's a very cool method for inversion if you haven't looked into it).
Well, i just fed it 2 pics of stallone, and im closest than i ever was with any face after 1500 iters , but its 256 size and 50 vectors , two init words - face ,photo.
So i have a plan, once it reaches likeness of reconstruction images, i will feed it 512 images, can i swap size like that when continuing finetuning , from 256 to 512 ?
But i must say that reconstruction at 256 res is not looking too god tho, lost likeness a bit , this one looks better at 512 res this image on bottom a is reconstruction, not actual sample, its how model interpreted original image and it trains from this :
I'd say 2 photos is actually not enough for training a likeness; I use around 10-20 pictures for my experiments. For the 256x256 method it works best to mix in a few extreme closeups of the face so that the AI can learn the finer details. I don't actually know if starting on 265x265 and then resuming at 512x512 is possible - I think it should be though because that's how SD was trained in the first place. For init words, I don't think "photo" is very good - I'm using "man" and "face" for that purpose - because those are the things that I want the AI to learn. Nevertheless; 1500 interations isn't very much. I usually get the best results at around 3000.
Yes ill try that, its also strange i cant have batchsize of 2 with 11gb of ram on 256 res.
Does batch size affect the training? i think if it sees more images at once it learns better ? If thats the case id try on colab pro.
I also try man face but i wanted it to know that its a photo version , aphoto style, so maybe wit that it could be editable easier with styles.
I have tight close up on face (jaw to chin) so i can show it likeness better at that res now.I noticed in SD you lose likeness when you are in medium shot but on macro close up you get best likeness of a person
WEll... im qute impressed now , barely started and thats the result on epoch 4 and 1500 iters, how many epochs you recommend ?
Sorry to hijack like this but im sure more people will come so i think this could be useful for them to read
OK so far from what i see... you should have mostly macro face close ups to get best identity, no ears visible besides one image like that stallone pic above, the rest should be very tight close ups of the face , probably even tighter than this one below
Ill try to resume and give it even tighter one , or start over with only tight macro shots of the face, cause im training mostlyu face and 256 is a bit low
Wow this is pretty good, way above my expectations
Oh crap this side shot looks too good, i wonder how editability wil work
Ok... i think that proves it, you can actually train a human face and retain identity ... this result is beyond what i expected and it barely started finetuning
OK so if anyone wants to get good results - drop resolution to 256 on bottom of yaml file
train:
target: ldm.data.personalized.PersonalizedBase
params:
size: 256
Also use init word "face " then actually give it face shots, not head shots , i got most images like this one and maybe two of whole head but majority is from eyebrows to lower lip framing
OK final result, 11k iterations , almost fell of chair when ive seen this result , most of my images were from hairline to jawline, 2 images of full head, overall 10 images
W̶e̶l̶l̶,̶ ̶t̶h̶i̶s̶ ̶c̶e̶r̶t̶a̶i̶n̶l̶y̶ ̶i̶s̶ ̶a̶n̶ ̶i̶n̶t̶e̶r̶e̶s̶t̶i̶n̶g̶ ̶d̶i̶s̶c̶o̶v̶e̶r̶y̶.̶ ̶ ̶ ̶
S̶o̶ ̶t̶h̶i̶s̶ ̶c̶o̶u̶l̶d̶ ̶t̶h̶e̶o̶r̶e̶t̶i̶c̶a̶l̶l̶y̶ ̶p̶r̶o̶v̶e̶ ̶t̶h̶a̶t̶ ̶y̶o̶u̶ ̶n̶e̶e̶d̶ ̶t̶o̶ ̶f̶i̶n̶e̶ ̶t̶u̶n̶e̶ ̶o̶n̶ ̶t̶h̶e̶ ̶b̶a̶s̶e̶ ̶r̶e̶s̶o̶l̶u̶t̶i̶o̶n̶ ̶S̶t̶a̶b̶l̶e̶ ̶D̶i̶f̶f̶u̶s̶i̶o̶n̶ ̶w̶a̶s̶ ̶t̶r̶a̶i̶n̶e̶d̶ ̶o̶n̶,̶ ̶a̶n̶d̶ ̶n̶o̶t̶ ̶t̶h̶e̶ ̶u̶p̶s̶c̶a̶l̶e̶d̶ ̶r̶e̶s̶ ̶(̶5̶1̶2̶)̶.̶ ̶E̶i̶t̶h̶e̶r̶ ̶w̶a̶y̶ ̶t̶h̶i̶s̶ ̶s̶h̶o̶u̶l̶d̶n̶'̶t̶ ̶h̶a̶v̶e̶ ̶c̶a̶u̶s̶e̶d̶ ̶t̶h̶e̶ ̶i̶s̶s̶u̶e̶s̶ ̶p̶e̶o̶p̶l̶e̶ ̶h̶a̶v̶e̶ ̶b̶e̶e̶n̶ ̶h̶a̶v̶i̶n̶g̶ ̶a̶t̶ ̶t̶h̶e̶ ̶h̶i̶g̶h̶e̶r̶ ̶r̶e̶s̶o̶l̶u̶t̶i̶o̶n̶,̶ ̶s̶o̶ ̶I̶ ̶w̶o̶n̶d̶e̶r̶ ̶w̶h̶y̶ ̶t̶h̶i̶s̶ ̶i̶s̶?̶ ̶I̶'̶l̶l̶ ̶h̶a̶v̶e̶ ̶t̶o̶ ̶r̶e̶a̶d̶ ̶t̶h̶r̶o̶u̶g̶h̶ ̶t̶h̶e̶ ̶p̶a̶p̶e̶r̶ ̶a̶g̶a̶i̶n̶ ̶t̶o̶ ̶f̶i̶g̶u̶r̶e̶ ̶i̶t̶ ̶o̶u̶t̶.̶ ̶
Edit: Tested this and figured I'm wrong here. It simply allows for better inversion, which the model is fully capable of. The real issue is adding prompts to the embeddings, which is still WIP.
drop resolution to 256 on bottom of yaml file with provided images to train also set to 256?
altryne is it cause of 50 vectors that i used or is it cause of 256 res drop ? which one is more responsible for this ? I restarted tuning, had it at 1 vector, now compared to 50 vectors id say this makes the difference the most, but whats the downside of using so many vectors? whats the most sane amount i can use and get reasonable editability ? You can see pretty much from first 3 sample that you will get likeness, now im trying 20 vectors.
So does anyone here know how to properly work with this? This is a [50, 768] tensor. All embeddings I've seen before are [1, 768]. Are you supposed to insert all 50 into the prompt, taking space of 50 tokens out of available 75? All the code that I've seen fails to actually use this embedding, include this repository, failing with error:
Traceback (most recent call last):
File "stable_txt2img.py", line 287, in <module>
main()
File "stable_txt2img.py", line 241, in main
uc = model.get_learned_conditioning(batch_size * [""])
File "B:\src\stable_diffusion\textual_inversion\ldm\models\diffusion\ddpm.py", line 594, in get_learned_conditioning
c = self.cond_stage_model.encode(c, embedding_manager=self.embedding_manager)
File "B:\src\stable_diffusion\textual_inversion\ldm\modules\encoders\modules.py", line 324, in encode
return self(text, **kwargs)
File "B:\soft\Python38\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "B:\src\stable_diffusion\textual_inversion\ldm\modules\encoders\modules.py", line 319, in forward
z = self.transformer(input_ids=tokens, **kwargs)
File "B:\soft\Python38\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "B:\src\stable_diffusion\textual_inversion\ldm\modules\encoders\modules.py", line 297, in transformer_forward
return self.text_model(
File "B:\soft\Python38\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "B:\src\stable_diffusion\textual_inversion\ldm\modules\encoders\modules.py", line 258, in text_encoder_forward
hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids, embedding_manager=embedding_manager)
File "B:\soft\Python38\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "B:\src\stable_diffusion\textual_inversion\ldm\modules\encoders\modules.py", line 183, in embedding_forward
inputs_embeds = embedding_manager(input_ids, inputs_embeds)
File "B:\soft\Python38\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "B:\src\stable_diffusion\textual_inversion\ldm\modules\embedding_manager.py", line 101, in forward
embedded_text[placeholder_idx] = placeholder_embedding
RuntimeError: shape mismatch: value tensor of shape [50, 768] cannot be broadcast to indexing result of shape [0, 768]
I manually inserted those 50 embeddings into prompt in order, and I am getting pictures of Stallone, but they all seem very same-y, which to me looks similar to overfitting, but I don't know if it's that or me incorrectly working with those embeddings.
Here are 9 pics all with different seeds:
You also have to update num_vectors_per_token in v1-inference.yaml to the same value you trained with.
With 50 vectors per token, extreme overfitting is to be expected; I'm currently trying to find the right balance between the very accurate likeness with many vectors and the more varied results of fewer vectors. The codebase also contains an idea of 'progressive words' where new vectors get added as training progresses which might be interesting to explore.
Oh, also; a .pt file trained with 256x256 images only really works well with the ddim encoder; given enough vectors it'll look "acceptable" with k_lms but if you want the same quality you got with the training samples, use ddim.
Another thing I've been experimenting with is different initializer words per vector - for example, I set num_vectors_per_token to 4 and then pass "face", "eyes", "nose", "mouth" as the initializer words in the hopes that each of the vectors will focus on that one specific part of the likeness. So far I'm not sure if I'd call it a success, but at this point I'm just throwing random idea I get at it.
Ah. That did the trick, thank you. If anyone cares, here's 9 images produced by this repo's code on the stallone embedding:
DDIM. Previous pic I posted was using euler ancestral from k-diffusion.
I used just * as prompt in both cases.
You also have to update num_vectors_per_token in v1-inference.yaml to the same value you trained with.
With 50 vectors per token, extreme overfitting is to be expected; I'm currently trying to find the right balance between the very accurate likeness with many vectors and the more varied results of fewer vectors. The codebase also contains an idea of 'progressive words' where new vectors get added as training progresses which might be interesting to explore.
Oh, also; a .pt file trained with 256x256 images only really works well with the ddim encoder; given enough vectors it'll look "acceptable" with k_lms but if you want the same quality you got with the training samples, use ddim.
Another thing I've been experimenting with is different initializer words per vector - for example, I set num_vectors_per_token to 4 and then pass "face", "eyes", "nose", "mouth" as the initializer words in the hopes that each of the vectors will focus on that one specific part of the likeness. So far I'm not sure if I'd call it a success, but at this point I'm just throwing random idea I get at it.
Im currently quick testing if i can still edit a style when use 5 vectors, are the cloned heads the result of 256 training? Can i resume training and change it to 512 or will it start over from 0 after i change to 512? Also did spreading 4 vectors into 4 init words helped ? Maybe i made a mistake by using face,photo as init words and it pushed him deep into photo realm, i will try vague "male" OK with 5 vectors i managed to rip out a person from photo style into anime style butits veryu hard, needs repetitions of word anime style, more than usual, so id say 5 is already too much but with 5 likeness is crap.... so thats that I think with all 77 vectors you will get great likeness right away, but there wont be any room left for editability . Ill try training for short time and highest vectors, then i try to spread inint words usiong high vectors I will also do other method, using more precise init words like , lips, cheeks,nose, nostrils, eyes,eyelids,chin,jawline, whatever i can find, and high vectors, maybe it will spread into details more and will leave stye up to editing
Overwhelmed overfitting with prompt ,from what i see, if you use 50 vectors, you just wasted 50 words in a prompt on your subject being a photograph of a man, so you have like 27 left to skew it into painting or a drawing ? so you have to overwhelm it hard to change a style. Or it might be that you have to use over 50 words to overwhelm it, theres definitely a ratio cause i can overwhelm low vector results faster . this is 50 vectors
Try playing with prompt weights in the webui?
started over, 2 vectors , 256 res, its at 36 epoch and 48k iters, will it be more editable than 50 vectors, we will see , i dont like the mirroring thing , how to turn it off ? his face is not identical when flipped
Ok after testing for editability, the one with 50 vector is a better way, it takes abouyt the same overwhelming to edit a style of 50 vectors as it takes 2 vector one but it takes about hour to train 50 vectors and takes like 8 hours to train 2 vectors to satisfied identity of the subject on 1080ti.
Training 512 on 11gb 1080ti is a waste of time, go with 256 res, maybe its a ram thing and batchsize thing, you wont get likeness with 512, not in one day anyway.
I guess overfitting is just a thing we have to live with for now, identity preservation is way more important IMO.
This is really interesting... I would like to as, how do you resume training? I have been looking aroung as to how to do that and can't find the answer. An example would be appreciated.
EDIT: Found my answer here: https://github.com/rinongal/textual_inversion/issues/38
Got around overfitting, thats not an issue anymore, go with as many vectors as you like to speedup training, got new subject to train on , style change is not an issue at all , adapts even to cartoon styles , res 448 ,will do 512 later on, you can also control emotions of the face to make it smile
@1blackbar how did you resolve the overfitting?
@1blackbar - looks great! Can you share what method you used to achieve this?
Looks like they're doing some sort of face swapping/inpainting rather than generating the whole image from scratch
When generating an embedding with more than 1 vector, is it possible to delete vectors and see what the difference is? Maybe training with a high vector count would be good, if we could then remove the ones which seem to be associated with features which we don't wat.