What does this PR do?

This functionality allows training/fine-tuning of the 9 channel inpainting models provided by

https://huggingface.co/stabilityai/stable-diffusion-2-inpainting
https://huggingface.co/runwayml/stable-diffusion-inpainting

This is due to noticing that many inpainting models provided to the community e.g. on https://civitai.com/ have unets with 4 input channels. 4 channel models may lack capacity and eventually quality in the inpainting tasks. To support the community to develop fully fledged inpainting models I have modified the text_to_image training pipeline to do inpainting.

Additions:

Added random masking strategy (squares) during the training, center crop during validation
Take first 3 images of the pokemon dataset as validation set

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline?
[x] Did you read our philosophy doc (important for complex PRs)?
[ ] Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[x] Did you write any new necessary tests?

Who can review?

@sayakpaul and @patrickvonplaten

Examples Out of Training Distribution Scenery:

Prompt: a drawing of a green pokemon with red eyes

Pre-trained

pretrained_0

Fine-tuned

finetuned_0

Prompt: a green and yellow toy with a red nose

Pre-trained

pretrained_1

Fine-tuned

finetuned_1

Prompt: a red and white ball with an angry look on its face

Pre-trained

pretrained_2

Fine-tuned

finetuned_2

Feb 09 '24 12:02 cryptexis

hi @patil-suraj @sayakpaul, was wondering if this is something interesting for you to look into ? Feedback is appreciated

Feb 14 '24 23:02 cryptexis

cool! gentle pin @patil-suraj

Feb 16 '24 03:02 yiyixuxu

I've experimented with finetuning proper inpainting models before. I strongly urge you to read the LAMA paper (https://arxiv.org/pdf/2109.07161.pdf) and implement their masking strategy (which is what is used by the stable-diffusion-inpainting checkpoint). I used a very simple masking strategy like what you had for a long time and never got satisfactory results with my model until switching to the LAMA masking strategy. Training on simple white square masks will severely degrade the performance of the pretrained SD inpainting model.

Feb 19 '24 00:02 drhead

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Feb 19 '24 01:02 HuggingFaceDocBuilderDev

I've experimented with finetuning proper inpainting models before. I strongly urge you to read the LAMA paper (https://arxiv.org/pdf/2109.07161.pdf) and implement their masking strategy (which is what is used by the stable-diffusion-inpainting checkpoint). I used a very simple masking strategy like what you had for a long time and never got satisfactory results with my model until switching to the LAMA masking strategy. Training on simple white square masks will severely degrade the performance of the pretrained SD inpainting model.

@sayakpaul

I thought having the most simple implementation would do. And then the user can decide which masking strategy to use actually. Sure will add that, if that's a deal breaker

Feb 19 '24 08:02 cryptexis

@sayakpaul I have adapted masking strategy from LAMA paper on my local branch. I have a question, is it according to guidelines to have a config file properties for the masking separately, like here: https://github.com/advimman/lama/blob/main/configs/training/data/abl-04-256-mh-dist-celeba.yaml#L10 ?

I feel it is a bit extensive and confusing to make all of those property values as part of CLI arguments, might clutter and confuse - which arguments are model specific and which ones are data specific.

Feb 26 '24 10:02 cryptexis

I feel it is a bit extensive and confusing to make all of those property values as part of CLI arguments, might clutter and confuse - which arguments are model specific and which ones are data specific.

You are absolutely correct. What we can do is include a note about the masking strategy in the README and link to your implementation. Does that sound good?

Feb 26 '24 16:02 sayakpaul

I think we also need to add a test case here.

Mar 02 '24 08:03 sayakpaul

Screenshot 2024-03-02 at 13 20 53 @sayakpaul I think it's a github glitch. :) to the extent that I cannot reply you there.

https://github.com/cryptexis/diffusers/blob/sd_15_inpainting/examples/inpainting/train_inpainting.py#L771 - in my repo I do not have anything similar to it under those lines. And the piece of code you're referring to is here.

Mar 02 '24 12:03 cryptexis

I think we also need to add a test case here.

I see a lot of https://huggingface.co/hf-internal-testing is used in the testing. Are usual mortals able to add unit tests ?

Mar 02 '24 12:03 cryptexis

Examples Training with Random Masking

Inference with Square Mask (as before)

Prompt: a drawing of a green pokemon with red eyes

pre-trained stable-diffusion-inpainting

pretrained_inpainting_0

fine-tuned stable-diffusion-inpainting

finetuned_inpainting_0

pre-trained stable-diffusion-v1-5

pretrained_text2img_0

fine-tuned stable-diffusion-v1-5 (no inpainting)

finetuned_text2img_0

fine-tuned stable-diffusion-v1-5 (inpainting)

finetuned_text2img_to_inpainting_0

Inference with Random Mask

pre-trained stable-diffusion-inpainting

pretrained_inpainting_2

fine-tuned stable-diffusion-inpainting

finetuned_inpainting_2

pre-trained stable-diffusion-v1-5

pretrained_text2img_2

fine-tuned stable-diffusion-v1-5 (no inpainting)

finetuned_text2img_2

fine-tuned stable-diffusion-v1-5 (inpainting)

finetuned_text2img_to_inpainting_2

Mar 02 '24 12:03 cryptexis

@cryptexis Thank you for providing the scripts and test cases. I want to train a inpainting model specifically for object removal based on the sd1.5-inpainting model, The goal of this model is to be able to remove objects without using a prompt, just like the ldm-inpainting model. Although the sd1.5-inpainting model can achieve decent results with the appropriate prompts (https://github.com/huggingface/diffusers/issues/973), it is often not easy to find the appropriate prompts, and it's easy to add extra objects.

Here's my plan right now:

I will not modify the StableDiffusionInpaintPipeline code, all prompts used during training are blank strings
The mask generation strategy will use methods from CM-GAN-Inpainting which is better than LaMA for inpainting. First use a segmentation model to process the images to obtain object masks. Then, randomly generated masks will never completely cover an object (for example, using 50% IOU as a threshold).

The generated mask looks like this:

I have not trained diffusion models before, any suggestions would be very helpful to me, thank you.

Mar 02 '24 14:03 Sanster

Looking good. I think the only that is pending now is the testing suite.

@sayakpaul worked yesterday on the tests. Hit a wall. Then tried to run tests for the text_to_image and hit the same wall:

attaching the screenshot: Screenshot 2024-03-03 at 06 56 07

Was wondering if it is a systematic issue across all tests....

Mar 03 '24 06:03 cryptexis

@sayakpaul worked yesterday on the tests. Hit a wall. Then tried to run tests for the text_to_image and hit the same wall:

Had it been the case, it would have been caught in the CI. The CI doesn't indicate so. Feel free to push the tests and then we can work towards fixing them. WDYT?

BTW, for fixing the code quality issues, we need to run make style && make quality from the root of diffusers.

Mar 03 '24 06:03 sayakpaul

@sayakpaul worked yesterday on the tests. Hit a wall. Then tried to run tests for the text_to_image and hit the same wall:

Had it been the case, it would have been caught in the CI. The CI doesn't indicate so. Feel free to push the tests and then we can work towards fixing them. WDYT?

BTW, for fixing the code quality issues, we need to run make style && make quality from the root of diffusers.

Done @sayakpaul , I think everything is addressed, tests are pushed. Thanks a lot for the patience, support and all the help!

Mar 03 '24 06:03 cryptexis

How to prepare dataset?

image mask prompt

Mar 04 '24 01:03 crapthings

@cryptexis let's fix the example tests that are failing now.

Mar 04 '24 03:03 sayakpaul

can anyone share script of sdxl inpainting fine tuning?

Mar 04 '24 16:03 Srinivasa-N707

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 04 '24 15:04 github-actions[bot]

When is this getting merged?

Apr 11 '24 16:04 cs-mshah

@cryptexis can you

address the final comments here https://github.com/huggingface/diffusers/pull/6922#discussion_r1519390094 - if peft is not used we can remove it; otherwise we are all good
make sure the tests pass

will merge once the tests pass!

Apr 11 '24 17:04 yiyixuxu

@Sanster Thanks for your plan, I also want to finetune an stable difffusion inpainting model for object removal. Have you tried this, how is the performance?

May 02 '24 06:05 zijinY

diffusers
diffusers copied to clipboard

Stable-Diffusion-Inpainting: Training Pipeline V1.5, V2

What does this PR do?

Before submitting

Who can review?

Examples Out of Training Distribution Scenery:

Prompt: a drawing of a green pokemon with red eyes

Pre-trained

Fine-tuned

Prompt: a green and yellow toy with a red nose

Pre-trained

Fine-tuned

Prompt: a red and white ball with an angry look on its face

Pre-trained

Fine-tuned

Examples Training with Random Masking

Inference with Square Mask (as before)

Prompt: a drawing of a green pokemon with red eyes

pre-trained stable-diffusion-inpainting

fine-tuned stable-diffusion-inpainting

pre-trained stable-diffusion-v1-5

fine-tuned stable-diffusion-v1-5 (no inpainting)

fine-tuned stable-diffusion-v1-5 (inpainting)

Inference with Random Mask

pre-trained stable-diffusion-inpainting

fine-tuned stable-diffusion-inpainting

pre-trained stable-diffusion-v1-5

fine-tuned stable-diffusion-v1-5 (no inpainting)

fine-tuned stable-diffusion-v1-5 (inpainting)

diffusers diffusers copied to clipboard

Stable-Diffusion-Inpainting: Training Pipeline V1.5, V2

What does this PR do?

Before submitting

Who can review?

Examples Out of Training Distribution Scenery:

Prompt: a drawing of a green pokemon with red eyes

Pre-trained

Fine-tuned

Prompt: a green and yellow toy with a red nose

Pre-trained

Fine-tuned

Prompt: a red and white ball with an angry look on its face

Pre-trained

Fine-tuned

Examples Training with Random Masking

Inference with Square Mask (as before)

Prompt: a drawing of a green pokemon with red eyes

pre-trained stable-diffusion-inpainting

fine-tuned stable-diffusion-inpainting

pre-trained stable-diffusion-v1-5

fine-tuned stable-diffusion-v1-5 (no inpainting)

fine-tuned stable-diffusion-v1-5 (inpainting)

Inference with Random Mask

pre-trained stable-diffusion-inpainting

fine-tuned stable-diffusion-inpainting

pre-trained stable-diffusion-v1-5

fine-tuned stable-diffusion-v1-5 (no inpainting)

fine-tuned stable-diffusion-v1-5 (inpainting)

diffusers
diffusers copied to clipboard