textual_inversion icon indicating copy to clipboard operation
textual_inversion copied to clipboard

Model training effect is not good

Open Lufffya opened this issue 3 years ago • 9 comments

My training data IMG_0855 HEIC IMG_0856 HEIC IMG_0857 HEIC IMG_0858 HEIC IMG_0859 HEIC

logs ouput images

--- Epoch 2 --- samples_scaled_gs-000500_e-000002_b-000000

--- Epoch 10 --- samples_scaled_gs-002500_e-000010_b-000000

--- Epoch 20 --- samples_scaled_gs-005000_e-000020_b-000000

--- Epoch 30 --- samples_scaled_gs-007500_e-000030_b-000000

Is there any way to make it better?

Lufffya avatar Sep 13 '22 09:09 Lufffya

I believe I read that input images from multiple angles is actually detrimental to the training process. I think you'll have more success if you made pictures from one angle but against different backdrops/surfances.

In any case I don't think we should expect miracles from Textual Inversion (for Stable Diffusion) right now; there's a lot of experimentation going on to find out what the optimal settings are and how to get more accurate results. For some objects we may even never get good results because what Textual Inversion can produce is limited by what was in the original SD training data.

oppie85 avatar Sep 13 '22 12:09 oppie85

welll . photos are quite bad, i wouldnt make it out as an artist, can tou place it according to straight isometric lines ? Rotate them, so the cats face is on top and you know, like hes standing, i had realy hard time figuring this out as a human so....

1blackbar avatar Sep 13 '22 14:09 1blackbar

photos are quite bad, i wouldnt make it out as an artist, can tou place it according to straight isometric lines ?

I think there's also some lens warping going on, which I've noticed can have a very detrimental effect on the likeness of human subjects. (i.e. one warped photo of your subject and SD will try to make a completely different-looking person.)

That said, Luffffffy, your results appear to be getting better after 30 epochs. The bottom left picture is starting to look like a screen with a cat-like frame. Some of my finetuning experiments required 50 or 60 epochs before achieving a reasonable degree of fidelity.

Also, what's your init word and number of vectors?

ThereforeGames avatar Sep 13 '22 14:09 ThereforeGames

I believe I read that input images from multiple angles is actually detrimental to the training process. I think you'll have more success if you made pictures from one angle but against different backdrops/surfances.

In any case I don't think we should expect miracles from Textual Inversion (for Stable Diffusion) right now; there's a lot of experimentation going on to find out what the optimal settings are and how to get more accurate results. For some objects we may even never get good results because what Textual Inversion can produce is limited by what was in the original SD training data.

I see. I also thought about whether it was limited by the data set of SD. Later, I changed another common item to train, and changed the background of the photo, and the result looked much better. But whether it is necessary to keep all photos at the same angle? Maybe only three photos are needed. Thank you

Lufffya avatar Sep 14 '22 03:09 Lufffya

welll . photos are quite bad, i wouldnt make it out as an artist, can tou place it according to straight isometric lines ? Rotate them, so the cats face is on top and you know, like hes standing, i had realy hard time figuring this out as a human so....

well. In this way, I also found that my photos are too bad, I'll take some new pictures as training sets, and try to keep them at the same angle. Thank you

Lufffya avatar Sep 14 '22 03:09 Lufffya

photos are quite bad, i wouldnt make it out as an artist, can tou place it according to straight isometric lines ?

I think there's also some lens warping going on, which I've noticed can have a very detrimental effect on the likeness of human subjects. (i.e. one warped photo of your subject and SD will try to make a completely different-looking person.)

That said, Luffffffy, your results appear to be getting better after 30 epochs. The bottom left picture is starting to look like a screen with a cat-like frame. Some of my finetuning experiments required 50 or 60 epochs before achieving a reasonable degree of fidelity.

Also, what's your init word and number of vectors?

In fact, I trained more than 200 epochs, but nothing is better than 30 epochs. It seems to be the problem of this training set. SD model has never seen such pictures. I guess All parameters remain default, because I don't know how to adjust it. init word is tablet (the full name is the LCD writing tablet, but it seems that multiple words cannot be set )

Lufffya avatar Sep 14 '22 03:09 Lufffya

@Luffffffy As others have stated, I'd try to make sure the images are roughly the same angle. Specifically, try to make sure the cat head is facing up (like in your first image). Feeding the model images rotated by 90 degrees tends to cause a mess.

rinongal avatar Sep 14 '22 06:09 rinongal

@Luffffffy As others have stated, I'd try to make sure the images are roughly the same angle. Specifically, try to make sure the cat head is facing up (like in your first image). Feeding the model images rotated by 90 degrees tends to cause a mess.

Thanks for your reply, I'll try.

Lufffya avatar Sep 14 '22 08:09 Lufffya

To add on from my experience, it's a balancing act between supplying variation and receiving coherent reconstructions. That's between the aforementioned element of camera angle and background as well as how many images you supply. In cases where objects/styles are similar enough but just different, adding >5 images can help.

GucciFlipFlops1917 avatar Sep 20 '22 17:09 GucciFlipFlops1917

Closing due to lack of activity. Feel free to reopen if you still need help.

rinongal avatar Oct 25 '22 19:10 rinongal