Can the inpainting model be used for txt2img?
I am busy porting the inpainting functionality into the InvokeAI distribution. One question that I have is whether the inpainting model can also be used for pure txt2img or img2img. Since both the inpainting model and standard 1.5 share the common crossattention model, it would be nice not to have to switch back and forth between them when the user wishes to do txt2img vs inpainting.
Thanks in advance.
The inpainting model can be used for pure txt2img and img2img but in my experience the standard 1.5 model produces better results. If switching between models is really that unpleasant for the user, can't you hide the model swap from the user?
There's really no escaping the fact that these models have been trained to perform drastically different tasks. The common crossattention model has been altered:
sd-v1-5.ckpt: Resumed from sd-v1-2.ckpt. 595k steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning to improve classifier-free guidance sampling.
sd-v1-5-inpaint.ckpt: Resumed from sd-v1-2.ckpt. 595k steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning to improve classifier-free guidance sampling. Then 440k steps of inpainting training at resolution 512x512 on “laion-aesthetics v2 5+” and 10% dropping of the text-conditioning. For inpainting, the UNet has 5 additional input channels (4 for the encoded masked-image and 1 for the mask itself) whose weights were zero-initialized after restoring the non-inpainting checkpoint. During training, we generate synthetic masks and in 25% mask everything.
https://huggingface.co/runwayml/stable-diffusion-inpainting
he inpainting model can be used for pure
txt2imgandimg2imgbut in my experience the standard 1.5 model produces better results.
I'm not convinced that is not purely subjective
I'm not convinced that is not purely subjective
It's mostly subjective. How do we implicitly prove that one model outperforms the other? My opinion is supported by the fact that one of these models, sd-v1-5-inpaint.ckpt, has been trained with partially masked images for hundreds of thousands of additional steps, and the other one hasn't.
A comparison of the different models (sd-v1-4, sd-v1-5, and sd-v1-5-inpaint) being used for txt2img can be seen here in a recent video from Aitrepreneur.
A comparison of the same models being used for inpainting can be seen here in the same video.
I'm not convinced that is not purely subjective
It's mostly subjective. How do we implicitly prove that one model outperforms the other? My opinion is supported by the fact that one of these models,
sd-v1-5-inpaint.ckpt, has been trained with partially masked images for hundreds of thousands of additional steps, and the other one hasn't.A comparison of the different models (
sd-v1-4,sd-v1-5, andsd-v1-5-inpaint) being used fortxt2imgcan be seen here in a recent video from Aitrepreneur.A comparison of the same models being used for inpainting can be seen here in the same video.
Thanks will take a look