diffusers
diffusers copied to clipboard
`StableDiffusionXLInstructPix2PixPipeline` doesn't work with cosxl_edit
Describe the bug
CosXL Edit is an InstructPix2Pix model (https://huggingface.co/stabilityai/cosxl) released together with CosXL, however trying to load it gives a size mismatch error
Reproduction
import torch
from diffusers import StableDiffusionXLInstructPix2PixPipeline
pipe = StableDiffusionXLInstructPix2PixPipeline.from_single_file(
"cosxl_edit.safetensors"
)
Logs
tokenizer_config.json: 100%
 905/905 [00:00<00:00, 13.7kB/s]
vocab.json: 100%
 961k/961k [00:00<00:00, 10.2MB/s]
merges.txt: 100%
 525k/525k [00:00<00:00, 17.4MB/s]
special_tokens_map.json: 100%
 389/389 [00:00<00:00, 20.1kB/s]
tokenizer.json: 100%
 2.22M/2.22M [00:00<00:00, 16.0MB/s]
config.json: 100%
 4.52k/4.52k [00:00<00:00, 250kB/s]
tokenizer_config.json: 100%
 904/904 [00:00<00:00, 50.1kB/s]
vocab.json: 100%
 862k/862k [00:00<00:00, 34.1MB/s]
merges.txt: 100%
 525k/525k [00:00<00:00, 22.2MB/s]
special_tokens_map.json: 100%
 389/389 [00:00<00:00, 21.6kB/s]
tokenizer.json: 100%
 2.22M/2.22M [00:00<00:00, 16.5MB/s]
config.json: 100%
 4.88k/4.88k [00:00<00:00, 253kB/s]
Some weights of the model checkpoint were not used when initializing CLIPTextModelWithProjection:
['text_model.embeddings.position_ids']
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-01c040bbaf7e> in <cell line: 5>()
3 from diffusers.utils import load_image
4
----> 5 pipe = StableDiffusionXLInstructPix2PixPipeline.from_single_file(
6 file, torch_dtype=torch.float16
7 )
4 frames
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py in _inner_fn(*args, **kwargs)
116 kwargs = smoothly_deprecate_use_auth_token(fn_name=fn.__name__, has_token=has_token, kwargs=kwargs)
117
--> 118 return fn(*args, **kwargs)
119
120 return _inner_fn # type: ignore
/usr/local/lib/python3.10/dist-packages/diffusers/loaders/single_file.py in from_single_file(cls, pretrained_model_link_or_path, **kwargs)
287 init_kwargs[name] = passed_class_obj[name]
288 else:
--> 289 components = build_sub_model_components(
290 init_kwargs,
291 class_name,
/usr/local/lib/python3.10/dist-packages/diffusers/loaders/single_file.py in build_sub_model_components(pipeline_components, pipeline_class_name, component_name, original_config, checkpoint, local_files_only, load_safety_checker, model_type, image_size, torch_dtype, **kwargs)
59 upcast_attention = kwargs.pop("upcast_attention", None)
60
---> 61 unet_components = create_diffusers_unet_model_from_ldm(
62 pipeline_class_name,
63 original_config,
/usr/local/lib/python3.10/dist-packages/diffusers/loaders/single_file_utils.py in create_diffusers_unet_model_from_ldm(pipeline_class_name, original_config, checkpoint, num_in_channels, upcast_attention, extract_ema, image_size, torch_dtype, model_type)
1320 from ..models.modeling_utils import load_model_dict_into_meta
1321
-> 1322 unexpected_keys = load_model_dict_into_meta(unet, diffusers_format_unet_checkpoint, dtype=torch_dtype)
1323 if unet._keys_to_ignore_on_load_unexpected is not None:
1324 for pat in unet._keys_to_ignore_on_load_unexpected:
/usr/local/lib/python3.10/dist-packages/diffusers/models/modeling_utils.py in load_model_dict_into_meta(model, state_dict, device, dtype, model_name_or_path)
150 if empty_state_dict[param_name].shape != param.shape:
151 model_name_or_path_str = f"{model_name_or_path} " if model_name_or_path is not None else ""
--> 152 raise ValueError(
153 f"Cannot load {model_name_or_path_str}because {param_name} expected shape {empty_state_dict[param_name]}, but got {param.shape}. If you want to instead overwrite randomly initialized weights, please make sure to pass both `low_cpu_mem_usage=False` and `ignore_mismatched_sizes=True`. For more information, see also: https://github.com/huggingface/diffusers/issues/1619#issuecomment-1345604389 as an example."
154 )
ValueError: Cannot load because conv_in.weight expected shape tensor(..., device='meta', size=(320, 4, 3, 3)), but got torch.Size([320, 8, 3, 3]). If you want to instead overwrite randomly initialized weights, please make sure to pass both `low_cpu_mem_usage=False` and `ignore_mismatched_sizes=True`. For more information, see also: https://github.com/huggingface/diffusers/issues/1619#issuecomment-1345604389 as an example.
### System Info
diffusers==0.27.2
### Who can help?
@sayakpaul , @yiyixuxu
should be able to get the checkpoint in
import torch
from diffusers import StableDiffusionXLInstructPix2PixPipeline
pipe = StableDiffusionXLInstructPix2PixPipeline.from_single_file(
"https://huggingface.co/stabilityai/cosxl/blob/main/cosxl.safetensors", num_in_channels=8,
)
cc @DN6 here let's make sure to support SDXL InstructPix2Pix out of box in https://github.com/huggingface/diffusers/pull/7496
we should support every model listed in here https://github.com/comfyanonymous/ComfyUI/blob/4201181b35402e0a992b861f8d2f0e0b267f52fa/comfy/supported_models.py#L479
This worked with num_in_channels=8
(as in: didn't error). However perceptually isn't behaving as it should
Edit image:
Edit prompt Turn sky into a cloudy one
:
import torch
from diffusers import StableDiffusionXLInstructPix2PixPipeline, EDMEulerScheduler
inst_file = "cosxl_edit.safetensors"
pipe = StableDiffusionXLInstructPix2PixPipeline.from_single_file(
inst_file, num_in_channels=8,
).to("cuda")
pipe.scheduler = EDMEulerScheduler(sigma_min=0.002, sigma_max=120.0, sigma_data=1.0, prediction_type="v_prediction")
resolution = 1024
image = load_image(
"https://hf.co/datasets/diffusers/diffusers-images-docs/resolve/main/mountain.png"
).resize((resolution, resolution))
edit_instruction = "Turn sky into a cloudy one"
edited_image = pipe(
prompt=edit_instruction,
image=image,
height=resolution,
width=resolution,
#guidance_scale=3.0,
#image_guidance_scale=1.5,
num_inference_steps=20,
).images[0]
Not sure if it's the exact guidance formulation that we have in the InstructPix2Pix pipeline though. That would matter a lot.
If it's possible, could you try to initialize the StableDiffusionXLInstructPix2PixPipeline
with each components initialized separately?
unet = ...
text_encoder = ...
text_encoder_2 = ...
vae = ...
scheduler = ...
pipeline = ...
Not sure if it's the exact guidance formulation that we have in the InstructPix2Pix pipeline though. That would matter a lot.
ComfyUI uses the same InstructPix2PixConditioning
node for it that they use for InstructPix2Pix itself. Overall this is how Comfy supported the CosXL models. Once that was in, the nodes for supporting it seem similar to InstructPix2Pix vanilla.
https://github.com/comfyanonymous/ComfyUI/commit/1088d1850f9b13233e3cf4460ee077b15e4f712f
This are the nodes for the comfyui official edit workflow
AIf it's possible, could you try to initialize the StableDiffusionXLInstructPix2PixPipeline with each components initialized separately?
As I'm using from_single_file
, I think the methods UNet2DConditionModel
etc don't have it afaik. How do you think that would help with debugging/making it work?
@apolinario
just have to scale the image_latents
adding this to the pipeline
# 6. Prepare Image latents
image_latents = self.prepare_image_latents(
image,
batch_size,
num_images_per_prompt,
prompt_embeds.dtype,
device,
do_classifier_free_guidance,
)
image_latents = latents * self.vae.config.scaling_factor
Nice finding. However, the SD Pix2Pix doesn't have it :o
Awesome! What's the best way to proceed here? Modify the pipeline to detect if scaling is needed or not or create a new one?
I think the following could work:
- after introducing the sigma scheduling changes to the EDM schedulers (as discussed internally with Suraj), we serialise the pipeline in the diffusers format. This gives us the scheduler with all the right configurations.
- in the pipelining code, we check if the scheduler has the EDM type and if so, we scale the latents.
WDYT? @yiyixuxu would love your thoughts too.
I think we should modify the pipeline to detect if scaling is needed
based on my understanding, how we scale latent is not dependent on the scheduler type but more specific to how this model is trained, i.e. in most of our pipelines, the image_latents
are scaled regardless of which scheduler you use https://github.com/huggingface/diffusers/blob/8e14535708f6af0794148150f5c073c4723dbbae/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_inpaint.py#L946
so I think we should add a pipeline config e.g. something like is_cosxl
, that the user can pass to from_single_file()
cc @DN6 here
so I think we should add a pipeline config e.g. something like is_cosxl, that the user can pass to from_single_file(), with this flag, we can map it to the correct scheduler config too in from_single_file
If we introduce that only for from_single_file()
, won't that introduce a discrepancy between from_pretrained()
and from_single_file()
methods of InstructPix2Pix then? I thought we were trying to reduce these kinds of discrepancies with Dhruv's refactor.
If the argument is added to the pipeline and is only a pipeline argument then that wouldn't be a discrepancy. What we want is to avoid configuring models via pipeline invocations
What we want is to avoid configuring models via pipeline invocations
Like this?
pipe = StableDiffusionXLInstructPix2PixPipeline.from_single_file(
"https://huggingface.co/stabilityai/cosxl/blob/main/cosxl.safetensors", num_in_channels=8,
)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.