A simple addition to support the new in-painting model released here: https://github.com/runwayml/stable-diffusion

We update the stable-diffusion dependency to point to the new repo and pass in the required additional features to the model. It requires an extra masked-image and mask inputs which act as visual conditioning for the model. Setting the mask to be all 1s can also be used for txt2img generation.

Implemented

K-Diffusion txt2img
K-Diffusion img2img
K-Diffusion inpaint

TODO

VanillaStableDiffusionSampler updates
Add a flag to detect if we need to create the masked tensors to save some memory.
Fix use_ema: False config option. Currently need to add use_ema: False in sd-v1-5-inpainting.yaml, otherwise the checkpoint will not load.

Oct 19 '22 10:10 random-thoughtss

Have you tested the vanilla 1.4 model with this PR?

If the config .yaml needs to be changed, you can ship a config and use shared.cmd_opts.config to use that new config when loading the Runway model.

Oct 19 '22 11:10 C43H66N12O12S2

what is that extra masked-image?

Oct 19 '22 11:10 kantsche

Have you tested the vanilla 1.4 model with this PR?

Yes, I observe matching seed parity with the CompVis stable-diffusion repo. The only code path that the visual conditing is used in is the new hybrid conditioning, so it shouldn't effect any crossattn models. Although it might be worth it to only create the masks when they are actually needed. https://github.com/runwayml/stable-diffusion/blob/main/ldm/models/diffusion/ddpm.py#L1431

If the config .yaml needs to be changed, you can ship a config and use shared.cmd_opts.config to use that new config when loading the Runway model.

Ideally the config should not need to be changed. I Originally misattributed the bug. LatentInpaintDiffusion in the yaml is fine, but the original sd-v1-5-inpainting.yaml is missing use_ema: False. This causes the checkpoint to be loaded incorrectly, effectively not loading the checkpoint at all.

what is that extra masked-image?

It provides the network with contextual information about the original image. Presumably this allows it to better fine-tune the in-painting, creating a more coherent image.

Oct 19 '22 11:10 random-thoughtss

@random-thoughtss You can do sd_config.model.params.use_ema = False in sd_models.py after OmegaConf.load

Oct 19 '22 12:10 C43H66N12O12S2

I'm in randomthoughtss branch, monkeypatched sd_config.model.params.use_ema = False > sd_models.py and 1.4 loads now, size mismatch persists for "1.5" inpaint.

caveat; torch1.12.1+rocm5.1, bur it usually doesn't matter.

File "/home/cornpop/conda/envs/shit/lib/python3.9/site-packages/gradio/routes.py", line 275, in run_predict
    output = await app.blocks.process_api(
  File "/home/cornpop/conda/envs/shit/lib/python3.9/site-packages/gradio/blocks.py", line 787, in process_api
    result = await self.call_function(fn_index, inputs, iterator)
  File "/home/cornpop/conda/envs/shit/lib/python3.9/site-packages/gradio/blocks.py", line 694, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/cornpop/conda/envs/shit/lib/python3.9/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/cornpop/conda/envs/shit/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/cornpop/conda/envs/shit/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/cornpop/ml/stable-diffusion-webui/modules/ui.py", line 1633, in <lambda>
    fn=lambda value, k=k: run_settings_single(value, key=k),
  File "/home/cornpop/ml/stable-diffusion-webui/modules/ui.py", line 1488, in run_settings_single
    opts.data_labels[key].onchange()
  File "/home/cornpop/ml/stable-diffusion-webui/webui.py", line 40, in f
    res = func(*args, **kwargs)
  File "/home/cornpop/ml/stable-diffusion-webui/webui.py", line 85, in <lambda>
    shared.opts.onchange("sd_model_checkpoint", wrap_queued_call(lambda: modules.sd_models.reload_model_weights(shared.sd_model)))
  File "/home/cornpop/ml/stable-diffusion-webui/modules/sd_models.py", line 252, in reload_model_weights
    load_model_weights(sd_model, checkpoint_info)
  File "/home/cornpop/ml/stable-diffusion-webui/modules/sd_models.py", line 169, in load_model_weights
    missing, extra = model.load_state_dict(sd, strict=False)
  File "/home/cornpop/conda/envs/shit/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LatentDiffusion:
	size mismatch for model.diffusion_model.input_blocks.0.0.weight: copying a param with shape torch.Size([320, 9, 3, 3]) from checkpoint, the shape in current model is torch.Size([320, 4, 3, 3]).

Oct 19 '22 12:10 cornpo

That’s most likely due to our repo using the CompVis config. Try also adding: sd_config.model.params.conditioning_key = hybrid

Oct 19 '22 12:10 C43H66N12O12S2

I think this model could also be used for outpainting with great effect.

Oct 19 '22 12:10 C43H66N12O12S2

   sd_config = OmegaConf.load(checkpoint_info.config)
###monkey    
    sd_config.model.params.use_ema = False
    sd_config.model.params.conditioning_key = hybrid
###
    sd_model = instantiate_from_config(sd_config.model)

Vanilla python webui.py

Traceback (most recent call last): File "/home/cornpop/ml/stable-diffusion-webui/webui.py", line 161, in <module> webui(cmd_opts.api) File "/home/cornpop/ml/stable-diffusion-webui/webui.py", line 122, in webui initialize() File "/home/cornpop/ml/stable-diffusion-webui/webui.py", line 84, in initialize shared.sd_model = modules.sd_models.load_model() File "/home/cornpop/ml/stable-diffusion-webui/modules/sd_models.py", line 215, in load_model sd_config.model.params.conditioning_key = hybrid NameError: name 'hybrid' is not defined

Oct 19 '22 12:10 cornpo

Change hybrid to "hybrid"

Oct 19 '22 12:10 C43H66N12O12S2

size mismatch for model.diffusion_model.input_blocks.0.0.weight: copying a param with shape torch.Size([320, 9, 3, 3]) from checkpoint, the shape in current model is torch.Size([320, 4, 3, 3]).

I give up for now. Non-programmer trashing up the collabo isn't going to do any good.

Oct 19 '22 13:10 cornpo

Actually, that shouldn't happen. @random-thoughtss When you tested 1.4, did you change the model dimensions to match 1.4 inside the config?

We shouldn't break compatibility with 1.4, as 1.5 (which will release very soon now) uses the same dimensions.

Oct 19 '22 13:10 C43H66N12O12S2

@AUTOMATIC1111 Curious to hear your thoughts on this model.

My thinking is like this: Load the normal model at all times (whether that's vanilla 1.4, 1.5, WD or whatever) Add a checkbox to outpainting & inpainting If the user checks this checkbox, load the RunwayML model, run inference, unload (maybe dependent on a user setting).

Oct 19 '22 13:10 C43H66N12O12S2

    sd_config.model.target = "ldm.models.diffusion.ddpm.LatentInpaintDiffusion"
    sd_config.model.params.use_ema = False
    sd_config.model.params.conditioning_key = "hybrid"
    sd_config.model.params.unet_config.params.in_channels = 9

This is all that's needed to load it as-is. I've had better results outpainting with this model than inpainting but probably a skill issue. (hilariously poor man's outpainting seems to work better than mk2 with this model)

We also don't need to switch to the RunwayML repo for this. We can continue our proud tradition of hijacking the CompVis repo. I wrote some working code performing just that.

Oct 19 '22 16:10 C43H66N12O12S2

oxy: switching to different repo is a big step, I need to grab his branch and check if it really is a lot better, then there can be some considerations.

Oct 19 '22 17:10 AUTOMATIC1111

also is the sd 1.5 the finetuned 1.5 model that emad keeps from being released?

Oct 19 '22 17:10 AUTOMATIC1111

We don’t need to switch repos. I wrote working hijacking code for this.

1.5 is (much like 1.4) just 1.2 but further along training.

1.4 is resumed from 1.2 and trained for ~270k steps I think, and 1.5 ~600k

that emad keeps from being released?

yes @AUTOMATIC1111

+modules/sd_hijack_loading.py
import math
import os
import sys
import traceback
import torch
import numpy as np
from einops import rearrange
from omegaconf import ListConfig
from modules import shared

import ldm.models.diffusion.ddpm
from ldm.models.diffusion.ddpm import LatentDiffusion


@torch.no_grad()
def get_unconditional_conditioning(self, batch_size, null_label=None):
    if null_label is not None:
        xc = null_label
        if isinstance(xc, ListConfig):
            xc = list(xc)
        if isinstance(xc, dict) or isinstance(xc, list):
            c = self.get_learned_conditioning(xc)
        else:
            if hasattr(xc, "to"):
                xc = xc.to(self.device)
            c = self.get_learned_conditioning(xc)
    else:
        # todo: get null label from cond_stage_model
        raise NotImplementedError()
    c = repeat(c, "1 ... -> b ...", b=batch_size).to(self.device)
    return c

class LatentInpaintDiffusion(LatentDiffusion):
    def __init__(
        self,
        concat_keys=("mask", "masked_image"),
        masked_image_key="masked_image",
        *args,
        **kwargs,
    ):
        super().__init__(*args, **kwargs)
        self.masked_image_key = masked_image_key
        assert self.masked_image_key in concat_keys
        self.concat_keys = concat_keys


    @torch.no_grad()
    def get_input(
        self, batch, k, cond_key=None, bs=None, return_first_stage_outputs=False
    ):
        # note: restricted to non-trainable encoders currently
        assert (
            not self.cond_stage_trainable
        ), "trainable cond stages not yet supported for inpainting"
        z, c, x, xrec, xc = super().get_input(
            batch,
            self.first_stage_key,
            return_first_stage_outputs=True,
            force_c_encode=True,
            return_original_cond=True,
            bs=bs,
        )

        assert exists(self.concat_keys)
        c_cat = list()
        for ck in self.concat_keys:
            cc = (
                rearrange(batch[ck], "b h w c -> b c h w")
                .to(memory_format=torch.contiguous_format)
                .float()
            )
            if bs is not None:
                cc = cc[:bs]
                cc = cc.to(self.device)
            bchw = z.shape
            if ck != self.masked_image_key:
                cc = torch.nn.functional.interpolate(cc, size=bchw[-2:])
            else:
                cc = self.get_first_stage_encoding(self.encode_first_stage(cc))
            c_cat.append(cc)
        c_cat = torch.cat(c_cat, dim=1)
        all_conds = {"c_concat": [c_cat], "c_crossattn": [c]}
        if return_first_stage_outputs:
            return z, all_conds, x, xrec, xc
        return z, all_conds

def do_hijack():
    ldm.models.diffusion.ddpm.get_unconditional_conditioning = get_unconditional_conditioning
    ldm.models.diffusion.ddpm.LatentInpaintDiffusion = LatentInpaintDiffusion

sd_models.py from modules.sd_hijack_loading import do_hijack in load_model

    if str(checkpoint_info.filename).endswith("inpainting.ckpt"):
        do_hijack()
        sd_config.model.target = "ldm.models.diffusion.ddpm.LatentInpaintDiffusion"
        sd_config.model.params.use_ema = False
        sd_config.model.params.conditioning_key = "hybrid"
        sd_config.model.params.unet_config.params.in_channels = 9

Oct 19 '22 17:10 C43H66N12O12S2

Since you researched it, do you mind writing a paragraph or so about what it does differenty, apart from using a new model?

Oct 19 '22 17:10 AUTOMATIC1111

I haven't researched this model very long. As far as I can see, it adds 5(1+4) new input channels for inpainting and finetunes for that.

Personally, I think it's a big improvement for outpainting, at least.

Oh, do you mean the code? Not much, the star of the show is the model. The code is almost entirely enablement code.

Oct 19 '22 17:10 C43H66N12O12S2

Here's a outpainting result (poor man's outpainting, 100 steps) 00002-510256737-tortoise 00045-991326057-tortoise relaxing in a beautiful forest, natural lighting

It can even outpaint twice without breaking down, something I've never been able to do with raw SD. 00050-4020233048-tortoise relaxing in a beautiful forest, natural lighting

Oct 19 '22 17:10 C43H66N12O12S2

I should have probably mentioned that the original config for the in-painting model was not released alongside the checkpoint but can be found here. https://raw.githubusercontent.com/runwayml/stable-diffusion/main/configs/stable-diffusion/v1-inpainting-inference.yaml

This config works with the current repo, with the additional use_ema: False.

sd_config.model.target = "ldm.models.diffusion.ddpm.LatentInpaintDiffusion"
sd_config.model.params.use_ema = False
sd_config.model.params.conditioning_key = "hybrid"
sd_config.model.params.unet_config.params.in_channels = 9

These manual changes by @C43H66N12O12S2 replicate all of the changes RunwayML made do their config. Would it be better to

hard-code these changes in the monkey patch?
Provide instructions on how to change the RunwayML config?
Force just use_ema and let the user figure out the config?

Oct 19 '22 18:10 random-thoughtss

Just a sidenote, reload_model_weights needs to be modified as well, or switching won't work if the initial model is a "normal" model. The easiest - if not elegant - way to achieve that would be if sd_model.sd_checkpoint_info.config != checkpoint_info.config or checkpoint_info.filename.endswith("inpainting.ckpt"):

Actually, the reverse will fail as well (switching from runway to any other model with 4 channels)

Also, we should add credit to the RunwayML repo in sd_hijack_loading.py

Aside from those minor adjustments, this PR is close to ready. Just need to support vanilla samplers.

Seems to not work with txt2img hires fix, but that’s not the usecase for this model anyways.

Oct 19 '22 18:10 C43H66N12O12S2

Hmm... if I checkout

c6f4a873d7c8a916814e3201044b84b72e09769a

and save https://raw.githubusercontent.com/runwayml/stable-diffusion/main/configs/stable-diffusion/v1-inpainting-inference.yaml (with additional use_ema:false parameter)

as {models}/sd-v1-5-inpainting.yaml

I get the error

return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [320, 9, 3, 3], expected input[2, 4, 64, 64] to have 9 channels, but got 4 channels instead

Were there other changes needed to get this working?