diffusers `StableDiffusionXLInstructPix2PixPipeline` doesn't work with cosxl

Describe the bug

CosXL Edit is an InstructPix2Pix model (https://huggingface.co/stabilityai/cosxl) released together with CosXL, however trying to load it gives a size mismatch error

Reproduction

import torch
from diffusers import StableDiffusionXLInstructPix2PixPipeline

pipe = StableDiffusionXLInstructPix2PixPipeline.from_single_file(
    "cosxl_edit.safetensors"
)

Logs

tokenizer_config.json: 100%
 905/905 [00:00<00:00, 13.7kB/s]
vocab.json: 100%
 961k/961k [00:00<00:00, 10.2MB/s]
merges.txt: 100%
 525k/525k [00:00<00:00, 17.4MB/s]
special_tokens_map.json: 100%
 389/389 [00:00<00:00, 20.1kB/s]
tokenizer.json: 100%
 2.22M/2.22M [00:00<00:00, 16.0MB/s]
config.json: 100%
 4.52k/4.52k [00:00<00:00, 250kB/s]
tokenizer_config.json: 100%
 904/904 [00:00<00:00, 50.1kB/s]
vocab.json: 100%
 862k/862k [00:00<00:00, 34.1MB/s]
merges.txt: 100%
 525k/525k [00:00<00:00, 22.2MB/s]
special_tokens_map.json: 100%
 389/389 [00:00<00:00, 21.6kB/s]
tokenizer.json: 100%
 2.22M/2.22M [00:00<00:00, 16.5MB/s]
config.json: 100%
 4.88k/4.88k [00:00<00:00, 253kB/s]
Some weights of the model checkpoint were not used when initializing CLIPTextModelWithProjection: 
 ['text_model.embeddings.position_ids']
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-01c040bbaf7e> in <cell line: 5>()
      3 from diffusers.utils import load_image
      4 
----> 5 pipe = StableDiffusionXLInstructPix2PixPipeline.from_single_file(
      6     file, torch_dtype=torch.float16
      7 )

4 frames
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py in _inner_fn(*args, **kwargs)
    116             kwargs = smoothly_deprecate_use_auth_token(fn_name=fn.__name__, has_token=has_token, kwargs=kwargs)
    117 
--> 118         return fn(*args, **kwargs)
    119 
    120     return _inner_fn  # type: ignore

/usr/local/lib/python3.10/dist-packages/diffusers/loaders/single_file.py in from_single_file(cls, pretrained_model_link_or_path, **kwargs)
    287                 init_kwargs[name] = passed_class_obj[name]
    288             else:
--> 289                 components = build_sub_model_components(
    290                     init_kwargs,
    291                     class_name,

/usr/local/lib/python3.10/dist-packages/diffusers/loaders/single_file.py in build_sub_model_components(pipeline_components, pipeline_class_name, component_name, original_config, checkpoint, local_files_only, load_safety_checker, model_type, image_size, torch_dtype, **kwargs)
     59         upcast_attention = kwargs.pop("upcast_attention", None)
     60 
---> 61         unet_components = create_diffusers_unet_model_from_ldm(
     62             pipeline_class_name,
     63             original_config,

/usr/local/lib/python3.10/dist-packages/diffusers/loaders/single_file_utils.py in create_diffusers_unet_model_from_ldm(pipeline_class_name, original_config, checkpoint, num_in_channels, upcast_attention, extract_ema, image_size, torch_dtype, model_type)
   1320         from ..models.modeling_utils import load_model_dict_into_meta
   1321 
-> 1322         unexpected_keys = load_model_dict_into_meta(unet, diffusers_format_unet_checkpoint, dtype=torch_dtype)
   1323         if unet._keys_to_ignore_on_load_unexpected is not None:
   1324             for pat in unet._keys_to_ignore_on_load_unexpected:

/usr/local/lib/python3.10/dist-packages/diffusers/models/modeling_utils.py in load_model_dict_into_meta(model, state_dict, device, dtype, model_name_or_path)
    150         if empty_state_dict[param_name].shape != param.shape:
    151             model_name_or_path_str = f"{model_name_or_path} " if model_name_or_path is not None else ""
--> 152             raise ValueError(
    153                 f"Cannot load {model_name_or_path_str}because {param_name} expected shape {empty_state_dict[param_name]}, but got {param.shape}. If you want to instead overwrite randomly initialized weights, please make sure to pass both `low_cpu_mem_usage=False` and `ignore_mismatched_sizes=True`. For more information, see also: https://github.com/huggingface/diffusers/issues/1619#issuecomment-1345604389 as an example."
    154             )

ValueError: Cannot load because conv_in.weight expected shape tensor(..., device='meta', size=(320, 4, 3, 3)), but got torch.Size([320, 8, 3, 3]). If you want to instead overwrite randomly initialized weights, please make sure to pass both `low_cpu_mem_usage=False` and `ignore_mismatched_sizes=True`. For more information, see also: https://github.com/huggingface/diffusers/issues/1619#issuecomment-1345604389 as an example.



### System Info

diffusers==0.27.2

### Who can help?

@sayakpaul , @yiyixuxu

Apr 09 '24 18:04 apolinario

should be able to get the checkpoint in

import torch
from diffusers import StableDiffusionXLInstructPix2PixPipeline

pipe = StableDiffusionXLInstructPix2PixPipeline.from_single_file(
    "https://huggingface.co/stabilityai/cosxl/blob/main/cosxl.safetensors", num_in_channels=8,
)

Apr 09 '24 18:04 yiyixuxu

cc @DN6 here let's make sure to support SDXL InstructPix2Pix out of box in https://github.com/huggingface/diffusers/pull/7496

we should support every model listed in here https://github.com/comfyanonymous/ComfyUI/blob/4201181b35402e0a992b861f8d2f0e0b267f52fa/comfy/supported_models.py#L479

Apr 09 '24 18:04 yiyixuxu

This worked with num_in_channels=8 (as in: didn't error). However perceptually isn't behaving as it should

Edit image:

Edit prompt Turn sky into a cloudy one:

import torch
from diffusers import StableDiffusionXLInstructPix2PixPipeline, EDMEulerScheduler

inst_file = "cosxl_edit.safetensors"

pipe = StableDiffusionXLInstructPix2PixPipeline.from_single_file(
    inst_file, num_in_channels=8,
).to("cuda")

pipe.scheduler = EDMEulerScheduler(sigma_min=0.002, sigma_max=120.0, sigma_data=1.0, prediction_type="v_prediction")

resolution = 1024
image = load_image(
    "https://hf.co/datasets/diffusers/diffusers-images-docs/resolve/main/mountain.png"
).resize((resolution, resolution))

edit_instruction = "Turn sky into a cloudy one"
edited_image = pipe(
    prompt=edit_instruction,
    image=image,
    height=resolution,
    width=resolution,
    #guidance_scale=3.0,
    #image_guidance_scale=1.5,
    num_inference_steps=20,
).images[0]

Apr 09 '24 22:04 apolinario

Not sure if it's the exact guidance formulation that we have in the InstructPix2Pix pipeline though. That would matter a lot.

If it's possible, could you try to initialize the StableDiffusionXLInstructPix2PixPipeline with each components initialized separately?

unet = ...
text_encoder = ...
text_encoder_2 = ...
vae = ...
scheduler = ...

pipeline = ...

Apr 10 '24 06:04 sayakpaul

Not sure if it's the exact guidance formulation that we have in the InstructPix2Pix pipeline though. That would matter a lot.

ComfyUI uses the same InstructPix2PixConditioning node for it that they use for InstructPix2Pix itself. Overall this is how Comfy supported the CosXL models. Once that was in, the nodes for supporting it seem similar to InstructPix2Pix vanilla. https://github.com/comfyanonymous/ComfyUI/commit/1088d1850f9b13233e3cf4460ee077b15e4f712f

This are the nodes for the comfyui official edit workflow

AIf it's possible, could you try to initialize the StableDiffusionXLInstructPix2PixPipeline with each components initialized separately?

As I'm using from_single_file, I think the methods UNet2DConditionModel etc don't have it afaik. How do you think that would help with debugging/making it work?

Apr 10 '24 06:04 apolinario

@apolinario

just have to scale the image_latents

adding this to the pipeline

        # 6. Prepare Image latents
        image_latents = self.prepare_image_latents(
            image,
            batch_size,
            num_images_per_prompt,
            prompt_embeds.dtype,
            device,
            do_classifier_free_guidance,
        )
        image_latents = latents * self.vae.config.scaling_factor

edited

Apr 10 '24 09:04 yiyixuxu

Nice finding. However, the SD Pix2Pix doesn't have it :o

Apr 10 '24 09:04 sayakpaul

Awesome! What's the best way to proceed here? Modify the pipeline to detect if scaling is needed or not or create a new one?

Apr 10 '24 12:04 apolinario

I think the following could work:

after introducing the sigma scheduling changes to the EDM schedulers (as discussed internally with Suraj), we serialise the pipeline in the diffusers format. This gives us the scheduler with all the right configurations.
in the pipelining code, we check if the scheduler has the EDM type and if so, we scale the latents.

WDYT? @yiyixuxu would love your thoughts too.

Apr 10 '24 12:04 sayakpaul

I think we should modify the pipeline to detect if scaling is needed

based on my understanding, how we scale latent is not dependent on the scheduler type but more specific to how this model is trained, i.e. in most of our pipelines, the image_latents are scaled regardless of which scheduler you use https://github.com/huggingface/diffusers/blob/8e14535708f6af0794148150f5c073c4723dbbae/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_inpaint.py#L946

so I think we should add a pipeline config e.g. something like is_cosxl, that the user can pass to from_single_file()

cc @DN6 here

Apr 11 '24 19:04 yiyixuxu

so I think we should add a pipeline config e.g. something like is_cosxl, that the user can pass to from_single_file(), with this flag, we can map it to the correct scheduler config too in from_single_file

If we introduce that only for from_single_file(), won't that introduce a discrepancy between from_pretrained() and from_single_file() methods of InstructPix2Pix then? I thought we were trying to reduce these kinds of discrepancies with Dhruv's refactor.

Apr 12 '24 03:04 sayakpaul

If the argument is added to the pipeline and is only a pipeline argument then that wouldn't be a discrepancy. What we want is to avoid configuring models via pipeline invocations

Apr 12 '24 04:04 DN6

What we want is to avoid configuring models via pipeline invocations

Like this?

pipe = StableDiffusionXLInstructPix2PixPipeline.from_single_file(
    "https://huggingface.co/stabilityai/cosxl/blob/main/cosxl.safetensors", num_in_channels=8,
)

Apr 12 '24 04:04 sayakpaul

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 10 '24 15:05 github-actions[bot]

diffusers
diffusers copied to clipboard

`StableDiffusionXLInstructPix2PixPipeline` doesn't work with cosxl_edit

Describe the bug

Reproduction

Logs

diffusers diffusers copied to clipboard

`StableDiffusionXLInstructPix2PixPipeline` doesn't work with cosxl_edit

Describe the bug

Reproduction

Logs

diffusers
diffusers copied to clipboard