diffusers Kohya Hires fix

is diffusers possible to support this hires fix?

it looks 1.5 work too

https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/13974

https://www.youtube.com/watch?v=SbgMwHDXthU

same seed at 1024x1024

This method will also improve the quality of inpainting without two pass

Mar 10 '24 05:03 crapthings

without

with

Mar 10 '24 10:03 crapthings

I think the improvements are quite amazing TBH. Ccing @kohya-ss in case you would like to elaborate the motivation behind the fix a bit.

@yiyixuxu WDYT about shipping this?

Mar 11 '24 04:03 sayakpaul

this one is 1280 x 1280 with Kohya Hires.fix

Mar 12 '24 09:03 crapthings

Gently pinging @kohya-ss since you reacted to my comment. If you find a moment, would love some sort of a brief explanation about the feature. Would be really helpful!

Mar 13 '24 12:03 sayakpaul

Hi!

This method is extremely simple.

The distortion of composition at large resolutions can be attributed to the U-Net's limited capacity to handle feature maps of such sizes effectively.

In Stable Diffusion, it has been observed that the composition is primarily determined in the deeper parts of the U-Net, as evident from the effects of block merging explored by the community.

Furthermore, the previews of the denoising steps reveal that the composition is also heavily influenced by the timesteps closer to the noise (i.e., the larger timesteps) during the denoising process.

Based on these observations, this method proposes resizing the feature maps to a smaller size specifically at the timesteps close to the noise in the deeper parts of the U-Net. This allows the U-Net to process the feature maps more effectively, thereby mitigating the distortion of the composition.

Please refer to the following link for the specific implementation code. In the input blocks of the U-Net at certain depths, the hidden states are downsampled, and in the corresponding depths of the output blocks, they are upsampled.

https://github.com/kohya-ss/sd-scripts/blob/2d7389185c021bc527b414563c245c5489d6328a/library/sdxl_original_unet.py#L1192

In this implementation, two depths (ds_depth1 and ds_depth2) can be specified to apply the method, which remain effective up to the timesteps ds_timesteps_1 and ds_timesteps_2, respectively. ds_ratio represents the shrinking ratio.

Additionally, laksjdjf has created a ComfyUI node for this method, which can be found at the following link: https://gist.github.com/laksjdjf/487a28ceda7f0853094933d2e138e3c6

Furthermore, although I have not personally used it, ComfyUI seems to officially support this method. Their node allows users to specify arbitrary depths, timesteps, and shrinking ratios, and it also appears to support cascading multiple instances of the node.

Mar 13 '24 13:03 kohya-ss

Thanks much for your detailed explanation. Really appreciate it.

Mar 13 '24 13:03 sayakpaul

Looks very interesting. Could this theoretically be used in the opposite way, to generate smaller images? When trying to generate small images with models that have been trained with high-res images, we often see the same kind of problems -- weird compositions, low quality. So, could the feature maps be resized to a larger size? So that it becomes possible to generate low-res images that have the same composition as high-res ones?

Mar 13 '24 13:03 jorgemcgomes

looking forward to this feature!

Mar 25 '24 03:03 blx0102

can someone make a community pipeline first for this? basically, you need to:

copy over the UNet2DConditionModel, rename it, add slightly modify the forward method to implement kohya Hires fix
add a from_unet method which is basically

@classmethod
    def from_unet(
        cls,
        unet: UNet2DConditionModel,
        ds_depth_1 = ds_depth_1,
        ds_depth_2 = ds_depth_2,
        ...
    ):  
        config = dict((unet.config)
        config[" ds_depth_1"] = ds_depth_1
        config["ds_depth_2"] = ds_depth_2,

        koyha_unet = cls.from_config(config)
        koyha_unet.load_state_dict(unet.state_dict())

copy over the SD or SDXL pipeline, rename it, and slightly update the __init__ method

def __init__():
      ...
       unet = kohyaHiresFixUnet2DConditionalModel.from_unet(unet, **kwargs)
        self.register_modules(
           ...
            unet=unet,
        )

and then please play around with it, figure out how to configure it for SD and SDXL, and show us some results! Once we have the community pipeline that we can play with, depending on how good the results are and how users react to it, we can then decide if we want to integrate this fix into our Unet or not (should be pretty straightforward)

Mar 25 '24 17:03 yiyixuxu

Hey, I'll work on this.

Mar 30 '24 22:03 sajadn

hey @sajadn are you working on this?

Apr 02 '24 06:04 satani99

I'm planning to open the PR this weekend. If you think you can finish it up sooner, feel free to go ahead.

Apr 02 '24 15:04 sajadn

I need 1 or 2 days more, sorry about the delay.

Apr 08 '24 14:04 sajadn

Hi, I reproduced the results on SD model (I don't have enough GPU memory for SDXL). Here are some examples with a resolution of 1000x1600:

prompt: "a dog sitting on the couch" without fix:

a_dog_sitting_on_the_couch_fix=False

with the fix: a_dog_sitting_on_the_couch_fix=True

prompt: "a pig sitting behind the desk" without fix: a_pig_sitting_behind_the_desk_fix=False

with the fix: a_pig_sitting_behind_the_desk_fix=True

prompt: "a cyclist front view near the finish line" without the fix: a_cyclist_front_view_near_the_finish_line_fix=False

with the fix: a_cyclist_front_view_near_the_finish_line_fix=True

As mentioned before on this thread, this is only an issue for high resolution images. For example, I provided 512x512 version of the images for the same prompts:

a_dog_sitting_on_the_couch_fix=False

a_pig_sitting_behind_the_desk_fix=False

a_cyclist_front_view_near_the_finish_line_fix=True

You can find the code in my fork

and here is the code snippet to generate images:

from diffusers.pipelines.stable_diffusion import StableDiffusionHighResFixPipeline
import torch

generator = torch.manual_seed(42)
with_high_res_fix = False
pipe = StableDiffusionHighResFixPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", 
                                                         generator=generator,
                                                         torch_dtype=torch.float16, 
                                                         use_safetensors=True, 
                                                         variant="fp16",
                                                         high_res_fix=[{'timestep': 600, 'scale_factor': 0.5, 'block_num': 1}] if with_high_res_fix else None
                                                         )
pipe.to("cuda")
prompt = "a dog sitting on the couch"
image = pipe(prompt=prompt,
                height=1000,
                width=1600, 
                num_inference_steps=50).images[0]

image.save(f"{prompt.replace(' ', '_')}_fix={with_high_res_fix}.png")

I have to work on something else for the rest of today. I'll clean the code and open up the PR tomorrow.

Apr 09 '24 19:04 sajadn

Hi, I reproduced the results on SD model (I don't have enough GPU memory for SDXL). Here are some examples with a resolution of 1000x1600:

prompt: "a dog sitting on the couch" without fix:

with the fix:

prompt: "a pig sitting behind the desk" without fix:

with the fix:

prompt: "a cyclist front view near the finish line" without the fix:

with the fix:

As mentioned before on this thread, this is only an issue for high resolution images. For example, I provided 512x512 version of the images for the same prompts:

You can find the code in my fork

and here is the code snippet to generate images:
from diffusers.pipelines.stable_diffusion import StableDiffusionHighResFixPipeline
import torch

generator = torch.manual_seed(42)
with_high_res_fix = False
pipe = StableDiffusionHighResFixPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", 
                                                         generator=generator,
                                                         torch_dtype=torch.float16, 
                                                         use_safetensors=True, 
                                                         variant="fp16",
                                                         high_res_fix=[{'timestep': 600, 'scale_factor': 0.5, 'layer_num': 1}] if with_high_res_fix else None
                                                         )
pipe.to("cuda")
prompt = "a dog sitting on the couch"
image = pipe(prompt=prompt,
                height=1000,
                width=1600, 
                num_inference_steps=50).images[0]

image.save(f"{prompt.replace(' ', '_')}_fix={with_high_res_fix}.png")
I have to work on something else for the rest of today. I'll clean the code and open up the PR tomorrow.

is this work with inpainting version?

Apr 11 '24 08:04 crapthings

does this work for SDXL and SDXL ControlNet pipelines?

Jun 02 '24 20:06 neuron-party

The PR of this issue is just a community pipeline for just SD not SDXL or any of the controlnet variants of any model.

If you're asking if this fix works with SDXL, it should.

It should also work with controlnet, but I don't see the point of this since controlnet is telling the model what to do so no need to fix the usual issues with high resolution images.

Jun 02 '24 20:06 asomoza

I tested it and yes, well kind off:

without fix	with fix

Probably needs a different set of params to work better.

Jun 02 '24 21:06 asomoza

@sajadn am i do wrong? When i set with_high_res_fix = True It got KeyError: 'block_num' But when i set with_high_res_fix = False It worked just fine

Jun 03 '24 14:06 Depfek6

@asomoza what is this? SDXL text-to-image?

Jun 03 '24 17:06 yiyixuxu

yeah, I did a test with it, just adapted the community pipeline to SDXL. I've been reading about it and in fact, for SDXL it needs another configuration. Comfyui has a core node now but it's more complex than this one.

For SDXL I don't think it will be used a lot, with the 2048x2048 image I did, I used the full 24GB of VRAM.

Jun 03 '24 17:06 asomoza

@Depfek6 I guess you're using the code snippet from a thread above, in the final commit I renamed layer_num to block_num. I think if you just do the same renaming it will be fixed. I'll edit the code snippet above to avoid future confusion.

Jun 03 '24 18:06 sajadn

could you share a code snippet of hires fix with sdxl? i couldn't get it working

Jun 06 '24 22:06 neuron-party

Sure but I just did a quick patch to make it work, let me know if you find a good set of params or if you find it useful with SDXL and maybe we can add it as a community pipeline.

You need an install from source or download the examples/community/kohya_hires_fix directory

from examples.community.kohya_hires_fix import UNet2DConditionModelHighResFix


class StableDiffusionXLHighResFixPipeline(StableDiffusionXLPipeline):
    def __init__(
        self,
        vae: AutoencoderKL,
        text_encoder: CLIPTextModel,
        text_encoder_2: CLIPTextModelWithProjection,
        tokenizer: CLIPTokenizer,
        tokenizer_2: CLIPTokenizer,
        unet: UNet2DConditionModel,
        scheduler: KarrasDiffusionSchedulers,
        image_encoder: CLIPVisionModelWithProjection = None,
        feature_extractor: CLIPImageProcessor = None,
        force_zeros_for_empty_prompt: bool = True,
        add_watermarker: Optional[bool] = None,
    ):
        super().__init__(
            vae=vae,
            text_encoder=text_encoder,
            text_encoder_2=text_encoder_2,
            tokenizer=tokenizer,
            tokenizer_2=tokenizer_2,
            unet=unet,
            scheduler=scheduler,
            image_encoder=image_encoder,
            feature_extractor=feature_extractor,
            force_zeros_for_empty_prompt=force_zeros_for_empty_prompt,
            add_watermarker=add_watermarker,
        )

        unet = UNet2DConditionModelHighResFix.from_unet(
            unet=unet, high_res_fix=[{"timestep": 600, "scale_factor": 0.5, "block_num": 1}]
        )

        self.register_modules(
            vae=vae,
            text_encoder=text_encoder,
            text_encoder_2=text_encoder_2,
            tokenizer=tokenizer,
            tokenizer_2=tokenizer_2,
            unet=unet,
            scheduler=scheduler,
            image_encoder=image_encoder,
            feature_extractor=feature_extractor,
        )
        self.register_to_config(force_zeros_for_empty_prompt=force_zeros_for_empty_prompt)
        self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1)
        self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor)

        self.default_sample_size = self.unet.config.sample_size

And then you can just use the StableDiffusionXLHighResFixPipeline instead of the regular one.

Jun 07 '24 02:06 asomoza

thanks! just tested it out with a few different set of params but couldn't get any significant results

Jun 08 '24 02:06 neuron-party