Error(s) in initializing SD3ControlNetModel by from_transformer
Describe the bug
Reproduction
from diffusers.models.controlnet_sd3 import SD3ControlNetModel
from diffusers.models.transformers import SD3Transformer2DModel
transformer = SD3Transformer2DModel.from_pretrained( "stabilityai/stable-diffusion-3-medium-diffusers", subfolder="transformer")
controlnet = SD3ControlNetModel.from_transformer(transformer)
Logs
No response
System Info
- 🤗 Diffusers version: 0.29.2
- Platform: Linux-4.18.0
- Running on a notebook?: No
- Running on Google Colab?: No
- Python version: 3.8.3
- PyTorch version (GPU?): 2.0.1 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.23.4
- Transformers version: 4.41.2
- Accelerate version: 0.23.0
- PEFT version: installed
- Bitsandbytes version: installed
- xFormers version: not installed NVIDIA A100 80GB
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
Who can help?
No response
Cc: @haofanwang
yeah I think we reversed the strict=False and strict=True, can you try this? and if it works, would you be willing to open a PR?
https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/controlnet_sd3.py#L242
config = transformer.config
config["num_layers"] = num_layers or config.num_layers
controlnet = cls(**config)
if load_weights_from_transformer:
controlnet.pos_embed.load_state_dict(transformer.pos_embed.state_dict())
controlnet.time_text_embed.load_state_dict(transformer.time_text_embed.state_dict())
controlnet.context_embedder.load_state_dict(transformer.context_embedder.state_dict())
controlnet.transformer_blocks.load_state_dict(transformer.transformer_blocks.state_dict(), strict=False)
controlnet.pos_embed_input = zero_module(controlnet.pos_embed_input)
return controlnet
yeah I think we reversed the
strict=Falseandstrict=True, can you try this? and if it works, would you be willing to open a PR? https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/controlnet_sd3.py#L242config = transformer.config config["num_layers"] = num_layers or config.num_layers controlnet = cls(**config) if load_weights_from_transformer: controlnet.pos_embed.load_state_dict(transformer.pos_embed.state_dict()) controlnet.time_text_embed.load_state_dict(transformer.time_text_embed.state_dict()) controlnet.context_embedder.load_state_dict(transformer.context_embedder.state_dict()) controlnet.transformer_blocks.load_state_dict(transformer.transformer_blocks.state_dict(), strict=False) controlnet.pos_embed_input = zero_module(controlnet.pos_embed_input) return controlnet
It still doesnt work and reports same bug
yeah I think we reversed the
strict=Falseandstrict=True, can you try this? and if it works, would you be willing to open a PR? https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/controlnet_sd3.py#L242config = transformer.config config["num_layers"] = num_layers or config.num_layers controlnet = cls(**config) if load_weights_from_transformer: controlnet.pos_embed.load_state_dict(transformer.pos_embed.state_dict()) controlnet.time_text_embed.load_state_dict(transformer.time_text_embed.state_dict()) controlnet.context_embedder.load_state_dict(transformer.context_embedder.state_dict()) controlnet.transformer_blocks.load_state_dict(transformer.transformer_blocks.state_dict(), strict=False) controlnet.pos_embed_input = zero_module(controlnet.pos_embed_input) return controlnet
controlnet.transformer_blocks and transformer.transformer_blocks are different at the last block and I'm trying to find out why. I think that is the reason? so i set the last layer of controlnet.transformer_blocks as JointTransformerBlock( dim=self.inner_dim, num_attention_heads=num_attention_heads, attention_head_dim=self.inner_dim, context_pre_only=True, ) and this time it works. But I'm not very sure it is correct to do so
SD3's transformer blocks are: self.transformer_blocks = nn.ModuleList( [ JointTransformerBlock( dim=self.inner_dim, num_attention_heads=self.config.num_attention_heads, attention_head_dim=self.inner_dim, context_pre_only=i == num_layers - 1, ) for i in range(self.config.num_layers) ] ) while sd3_controlnet's transformer blocks are: self.transformer_blocks = nn.ModuleList( [ JointTransformerBlock( dim=self.inner_dim, num_attention_heads=self.config.num_attention_heads, attention_head_dim=self.inner_dim, context_pre_only= False, ) for i in range(self.config.num_layers) ] ) maybe this is the problem?
@wangqixun Can you explain a bit here? The last block is different.
@wangqixun Can you explain a bit here? The last block is different.
In fact the example controlnet (InstantX/SD3-Controlnet-Canny) can be loaded correctly So its a little bit confusing...
No worry, I'm testing. Will update soon.
The thing is kind of different in transformer-based ControlNet, because we don't fix the number of layers in ControlNet. While in UNet-based ControlNet, it always uses down_blocks and mid_block.
(1) If we set context_pre_only=i == num_layers - 1 in SD3ControlNetModel, we have to set num_layers as the same as SD3 base model. If not, there will be a size mismatch error. But it is not suggested, because in such case, the ControlNet is very heavy and more like a ReferenceNet. As you can see from our released checkpoints, the num_layers for ControlNet is 6 or 12, aka to half copy of UNet.
(2) So, our current solution is set num_layers in from_transformer to be 12 instead of None by default, then you can freely load weights from transformer, because we only use intermediate layers whose context_pre_only are False. The only obstacle is that we cannot set num_layers=24, as the last block is different.
In your usage, you can manually set by controlnet = SD3ControlNetModel.from_transformer(transformer, num_layers=6).
@ChenhLiwnl
sorry but another question, I noticed that in main branch the transformer blocks' attention_head_dim is set to be self.config.attention_head_dim, while in v0.29.2 released version it is self.inner_dim currently self.config.attention_head_dim of the released model seem to be 64 while self.inner_dim is 1536? it seems that they are not same value, so which one is right?
@ChenhLiwnl it is a bug we fixed - main is correct https://github.com/huggingface/diffusers/pull/8608
closing this now since the issue is resolved! :)
Hi there, when specifying num_layers in SD3ControlNetModel.from_transformer, the function results in an error of unexpected keys in state_dict due to the rest of the layers in the SD3 transformer. I think it can be fixed with a for loop that loads the weights of the specified layers. I will be grateful if you can fix that or I can fix it by myself.
Could you open a new issue with a minimal reproducer? Cc: @haofanwang
Sure. The issue is opened.