Stable-textual-inversion_win icon indicating copy to clipboard operation
Stable-textual-inversion_win copied to clipboard

Per image tokens with a larger dataset.

Open LexCybermac opened this issue 2 years ago • 7 comments

I'm curious about trying per-image token training with my dataset however the per_img_token_list in personalized.py is only 22 entries long, which appears to be posing an issue in that there aren't enough of those tokens to assign to samples in my set.

I have considered manually supplementing the list by hand which would be an incredibly tedious but perhaps doable task on account of the fact that the set I'm wanting to train with only has 912 samples, but other datasets I was interested in experimenting with have past half a million samples for which a manual approach simply wouldn't suffice. Is there any possibility of generating unique multi-character tokens on the fly when starting training to an extent varying depending on dataset size?

LexCybermac avatar Aug 26 '22 13:08 LexCybermac

From the paper it says the best results are with up to 5 images... so having large dataset wont make it better, im trying now to train separately head, armor and other parts of the body of my subject and then merge those togheter to one pt file. Page 21 https://arxiv.org/pdf/2208.01618.pdf

1blackbar avatar Aug 26 '22 15:08 1blackbar

You're quite right to highlight the paper concluded that ~5 image samples yielded the best results in terms of providing a middle ground between an embedding not overly optimised to the point it can't easily be edited/influenced by prompts and well.. looking good.

But at the same time they don't believe they experimented with datasets to the scale as those which I've got on hand (which is quite fair as that would be outside the scope of their project) and I very much feel the inclination to experiment to find out just how working with larger sets plays out.

I've trained for a night or so in a standard fashion on my dataset and found the results to be lacklustre (which is no surprise given the training time required would after all be proportional to set size) however I've not yet had chance to experiment with per-image tokens and I think there's a great potential to the concept when dealing with incredibly large varied datasets in that indeed the whole point of having a set this size is that I want it to learn the common concept across deliberately varied dataset samples rather than the specifics of any one.

In this particular case I've built up a dataset of a specific line of Star Wars action figure boxes, acquired mainly from ebay listings with tremendous variety in background, camera focal length, lighting, angle, and so on. My hope is that I can train an embedding to the point it accurately represents the general concept of one of these boxes by taking on elements in common across all samples in the dataset and not getting hung up on details that vary from sample to sample. The description of per-image tokens presents it as an idea that could go hand in hand with this particular goal, so even if it's unlikely to work I'm very eager to try it.

LexCybermac avatar Aug 26 '22 17:08 LexCybermac

You're quite right to highlight the paper concluded that ~5 image samples yielded the best results in terms of providing a middle ground between an embedding not overly optimised to the point it can't easily be edited/influenced by prompts and well.. looking good.

But at the same time they don't believe they experimented with datasets to the scale as those which I've got on hand (which is quite fair as that would be outside the scope of their project) and I very much feel the inclination to experiment to find out just how working with larger sets plays out.

I've trained for a night or so in a standard fashion on my dataset and found the results to be lacklustre (which is no surprise given the training time required would after all be proportional to set size) however I've not yet had chance to experiment with per-image tokens and I think there's a great potential to the concept when dealing with incredibly large varied datasets in that indeed the whole point of having a set this size is that I want it to learn the common concept across deliberately varied dataset samples rather than the specifics of any one.

In this particular case I've built up a dataset of a specific line of Star Wars action figure boxes, acquired mainly from ebay listings with tremendous variety in background, camera focal length, lighting, angle, and so on. My hope is that I can train an embedding to the point it accurately represents the general concept of one of these boxes by taking on elements in common across all samples in the dataset and not getting hung up on details that vary from sample to sample. The description of per-image tokens presents it as an idea that could go hand in hand with this particular goal, so even if it's unlikely to work I'm very eager to try it.

i've tried training on a dataset of 18000 images for 16 hours and it came out fine but that was for a style, not sure if it would work for subjects,

if you want to finetune text conditional/image I recommend waiting a while until people make one

nicolai256 avatar Aug 26 '22 17:08 nicolai256

Please post update on your experiments with training on your sw figures , this is all pretty new so it would be good to know how it went for you, i trained just head 23k iters on 5 images and had yaml with init words - heman,head,haircut as these 3 i want to address the most but idont think it did much better than 36k iterations of the whole body images with pretty vague init word "he-man" with 25 pics of the body and some close ups. And in the end stablediff results are so fluid its hard to retain identity of a subject , i think inversion is pretty different from actual training , it injects maybe like 30-40 % of new data and the rest 60% is mixed with what model already was trained on New objects that are unique might look more imporessive after training cause of their uniqueness but still will be derivative versions of original images , im not even sure if you can get 80-90% identity with inversion of this type. There was this girl training pixel art style on latent diffusion that went for 70k i think, maybe im not training long enough, i dontknow. i kinda lost patience after 36k iters and mediocre results.I wish there was a prope guide on how to train a specific person and his likeness.

1blackbar avatar Aug 26 '22 23:08 1blackbar

training the identity of a subject is better with just 3-5 full body images i think tbh

art styles need a lot more images and iterations so the ai starts focusing on art style more, if u want a specific character i think its best u use only a couple images and not too many iterations, try 5000-7000.

this is all pretty new, i'm sure someone will post a clear guide in a not too far future

i trained for 110k iterations but again, that's only to randomize the artstyle

nicolai256 avatar Aug 26 '22 23:08 nicolai256

ok ill try maybe with robocop cause hes pretty bad in SD despite a lot of his pics being in dataset

1blackbar avatar Aug 27 '22 10:08 1blackbar

10k iterations, results are pretty bad. Imo the code is not really meant to train identity, but a pretty vague representation of a subjectIf it has one colour and pretty ranndom pattern thats fine, but it ifs a complex subject like a human being, with precise patterns or uniforms - it just doesnt work and wont learn your face.You would have to train like they trained with original weights. Id love do be proven wrong but from my about 40k iterations multiple times i kinda know what i can expect. Works fine on stuff that can be kinda already made with vanilla weights and no embeddings by just using precise prompting but if you want to inject a new face in a dataset - unlikely..

1blackbar avatar Aug 27 '22 21:08 1blackbar