diffusers
diffusers copied to clipboard
Kohya Hires fix
is diffusers possible to support this hires fix?
it looks 1.5 work too
https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/13974
https://www.youtube.com/watch?v=SbgMwHDXthU
same seed at 1024x1024
This method will also improve the quality of inpainting without two pass
without
with
I think the improvements are quite amazing TBH. Ccing @kohya-ss in case you would like to elaborate the motivation behind the fix a bit.
@yiyixuxu WDYT about shipping this?
this one is 1280 x 1280 with Kohya Hires.fix
Gently pinging @kohya-ss since you reacted to my comment. If you find a moment, would love some sort of a brief explanation about the feature. Would be really helpful!
Hi!
This method is extremely simple.
The distortion of composition at large resolutions can be attributed to the U-Net's limited capacity to handle feature maps of such sizes effectively.
In Stable Diffusion, it has been observed that the composition is primarily determined in the deeper parts of the U-Net, as evident from the effects of block merging explored by the community.
Furthermore, the previews of the denoising steps reveal that the composition is also heavily influenced by the timesteps closer to the noise (i.e., the larger timesteps) during the denoising process.
Based on these observations, this method proposes resizing the feature maps to a smaller size specifically at the timesteps close to the noise in the deeper parts of the U-Net. This allows the U-Net to process the feature maps more effectively, thereby mitigating the distortion of the composition.
Please refer to the following link for the specific implementation code. In the input blocks of the U-Net at certain depths, the hidden states are downsampled, and in the corresponding depths of the output blocks, they are upsampled.
https://github.com/kohya-ss/sd-scripts/blob/2d7389185c021bc527b414563c245c5489d6328a/library/sdxl_original_unet.py#L1192
In this implementation, two depths (ds_depth1 and ds_depth2) can be specified to apply the method, which remain effective up to the timesteps ds_timesteps_1 and ds_timesteps_2, respectively. ds_ratio represents the shrinking ratio.
Additionally, laksjdjf has created a ComfyUI node for this method, which can be found at the following link: https://gist.github.com/laksjdjf/487a28ceda7f0853094933d2e138e3c6
Furthermore, although I have not personally used it, ComfyUI seems to officially support this method. Their node allows users to specify arbitrary depths, timesteps, and shrinking ratios, and it also appears to support cascading multiple instances of the node.
Thanks much for your detailed explanation. Really appreciate it.
Looks very interesting. Could this theoretically be used in the opposite way, to generate smaller images? When trying to generate small images with models that have been trained with high-res images, we often see the same kind of problems -- weird compositions, low quality. So, could the feature maps be resized to a larger size? So that it becomes possible to generate low-res images that have the same composition as high-res ones?
looking forward to this feature!
can someone make a community pipeline first for this? basically, you need to:
- copy over the
UNet2DConditionModel, rename it, add slightly modify theforwardmethod to implement kohya Hires fix - add a
from_unetmethod which is basically
@classmethod
def from_unet(
cls,
unet: UNet2DConditionModel,
ds_depth_1 = ds_depth_1,
ds_depth_2 = ds_depth_2,
...
):
config = dict((unet.config)
config[" ds_depth_1"] = ds_depth_1
config["ds_depth_2"] = ds_depth_2,
koyha_unet = cls.from_config(config)
koyha_unet.load_state_dict(unet.state_dict())
- copy over the SD or SDXL pipeline, rename it, and slightly update the
__init__method
def __init__():
...
unet = kohyaHiresFixUnet2DConditionalModel.from_unet(unet, **kwargs)
self.register_modules(
...
unet=unet,
)
and then please play around with it, figure out how to configure it for SD and SDXL, and show us some results! Once we have the community pipeline that we can play with, depending on how good the results are and how users react to it, we can then decide if we want to integrate this fix into our Unet or not (should be pretty straightforward)
Hey, I'll work on this.
hey @sajadn are you working on this?
I'm planning to open the PR this weekend. If you think you can finish it up sooner, feel free to go ahead.
I need 1 or 2 days more, sorry about the delay.
Hi, I reproduced the results on SD model (I don't have enough GPU memory for SDXL). Here are some examples with a resolution of 1000x1600:
prompt: "a dog sitting on the couch" without fix:
with the fix:
prompt: "a pig sitting behind the desk"
without fix:
with the fix:
prompt: "a cyclist front view near the finish line"
without the fix:
with the fix:
As mentioned before on this thread, this is only an issue for high resolution images. For example, I provided 512x512 version of the images for the same prompts:
You can find the code in my fork
and here is the code snippet to generate images:
from diffusers.pipelines.stable_diffusion import StableDiffusionHighResFixPipeline
import torch
generator = torch.manual_seed(42)
with_high_res_fix = False
pipe = StableDiffusionHighResFixPipeline.from_pretrained("CompVis/stable-diffusion-v1-4",
generator=generator,
torch_dtype=torch.float16,
use_safetensors=True,
variant="fp16",
high_res_fix=[{'timestep': 600, 'scale_factor': 0.5, 'block_num': 1}] if with_high_res_fix else None
)
pipe.to("cuda")
prompt = "a dog sitting on the couch"
image = pipe(prompt=prompt,
height=1000,
width=1600,
num_inference_steps=50).images[0]
image.save(f"{prompt.replace(' ', '_')}_fix={with_high_res_fix}.png")
I have to work on something else for the rest of today. I'll clean the code and open up the PR tomorrow.
Hi, I reproduced the results on SD model (I don't have enough GPU memory for SDXL). Here are some examples with a resolution of 1000x1600:
prompt: "a dog sitting on the couch" without fix:
with the fix:
prompt: "a pig sitting behind the desk" without fix:
with the fix:
prompt: "a cyclist front view near the finish line" without the fix:
with the fix:
As mentioned before on this thread, this is only an issue for high resolution images. For example, I provided 512x512 version of the images for the same prompts:
You can find the code in my fork
and here is the code snippet to generate images:
from diffusers.pipelines.stable_diffusion import StableDiffusionHighResFixPipeline import torch generator = torch.manual_seed(42) with_high_res_fix = False pipe = StableDiffusionHighResFixPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", generator=generator, torch_dtype=torch.float16, use_safetensors=True, variant="fp16", high_res_fix=[{'timestep': 600, 'scale_factor': 0.5, 'layer_num': 1}] if with_high_res_fix else None ) pipe.to("cuda") prompt = "a dog sitting on the couch" image = pipe(prompt=prompt, height=1000, width=1600, num_inference_steps=50).images[0] image.save(f"{prompt.replace(' ', '_')}_fix={with_high_res_fix}.png")I have to work on something else for the rest of today. I'll clean the code and open up the PR tomorrow.
is this work with inpainting version?
does this work for SDXL and SDXL ControlNet pipelines?
The PR of this issue is just a community pipeline for just SD not SDXL or any of the controlnet variants of any model.
If you're asking if this fix works with SDXL, it should.
It should also work with controlnet, but I don't see the point of this since controlnet is telling the model what to do so no need to fix the usual issues with high resolution images.
I tested it and yes, well kind off:
| without fix | with fix |
|---|---|
Probably needs a different set of params to work better.
@sajadn am i do wrong? When i set with_high_res_fix = True It got KeyError: 'block_num' But when i set with_high_res_fix = False It worked just fine
@asomoza what is this? SDXL text-to-image?
yeah, I did a test with it, just adapted the community pipeline to SDXL. I've been reading about it and in fact, for SDXL it needs another configuration. Comfyui has a core node now but it's more complex than this one.
For SDXL I don't think it will be used a lot, with the 2048x2048 image I did, I used the full 24GB of VRAM.
@Depfek6 I guess you're using the code snippet from a thread above, in the final commit I renamed layer_num to block_num. I think if you just do the same renaming it will be fixed. I'll edit the code snippet above to avoid future confusion.
could you share a code snippet of hires fix with sdxl? i couldn't get it working
Sure but I just did a quick patch to make it work, let me know if you find a good set of params or if you find it useful with SDXL and maybe we can add it as a community pipeline.
You need an install from source or download the examples/community/kohya_hires_fix directory
from examples.community.kohya_hires_fix import UNet2DConditionModelHighResFix
class StableDiffusionXLHighResFixPipeline(StableDiffusionXLPipeline):
def __init__(
self,
vae: AutoencoderKL,
text_encoder: CLIPTextModel,
text_encoder_2: CLIPTextModelWithProjection,
tokenizer: CLIPTokenizer,
tokenizer_2: CLIPTokenizer,
unet: UNet2DConditionModel,
scheduler: KarrasDiffusionSchedulers,
image_encoder: CLIPVisionModelWithProjection = None,
feature_extractor: CLIPImageProcessor = None,
force_zeros_for_empty_prompt: bool = True,
add_watermarker: Optional[bool] = None,
):
super().__init__(
vae=vae,
text_encoder=text_encoder,
text_encoder_2=text_encoder_2,
tokenizer=tokenizer,
tokenizer_2=tokenizer_2,
unet=unet,
scheduler=scheduler,
image_encoder=image_encoder,
feature_extractor=feature_extractor,
force_zeros_for_empty_prompt=force_zeros_for_empty_prompt,
add_watermarker=add_watermarker,
)
unet = UNet2DConditionModelHighResFix.from_unet(
unet=unet, high_res_fix=[{"timestep": 600, "scale_factor": 0.5, "block_num": 1}]
)
self.register_modules(
vae=vae,
text_encoder=text_encoder,
text_encoder_2=text_encoder_2,
tokenizer=tokenizer,
tokenizer_2=tokenizer_2,
unet=unet,
scheduler=scheduler,
image_encoder=image_encoder,
feature_extractor=feature_extractor,
)
self.register_to_config(force_zeros_for_empty_prompt=force_zeros_for_empty_prompt)
self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1)
self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor)
self.default_sample_size = self.unet.config.sample_size
And then you can just use the StableDiffusionXLHighResFixPipeline instead of the regular one.
thanks! just tested it out with a few different set of params but couldn't get any significant results








