textual_inversion
textual_inversion copied to clipboard
Artifacts arising from 256x256 data
Due to GPU constraints (RTX 3070, 8gb VRAM), I lowered my image training dimensions to half the 512x512 standard. Working with faces, a natural problem arose when priming the training data as results returned double/triple the faces with two-three people.
I know the AUTOMATIC1111 repo has a process for scaling the seed and a "high-res fix". Is something similar possible during the embedding training process?
I'm not sure I understood the problem and where you are seeing it. Is the encoder-decoder part (i.e. the reconstruction images in the log dir) creating additional faces in images with more than 1 person? Are you getting a random number of people in images produced with the learned embedding? Can you post some examples?
I unfortunately can't share pictures as I don't want to post my face, but reconstructed images are fine. The issue is the samples and samples_scaled. Both indeed present 2-3 people based on my input photos of myself.
A minor fix is using 384x384 with close-up pictures of the face. It seems like view angles farther away is more prone to those issues. But any ideas why the scaling problems appear in the same way they do if Stable Diffusion dimensions are lengthened too much?
Sorry, seems like I completely missed your followup here. Do you still need help with this issue?
Thanks for checking. I will be fine for now :] Truly it's a matter of waiting for optimizations to roll out at this stage.
You could try to have a look at https://github.com/AUTOMATIC1111/stable-diffusion-webui. They have an alternative implementation and plenty of optimizations. I think I saw someone say they managed to get it working on a 6GB card.