Z-Image-Turbo ControlNet
What does this PR do?
In the original code this is not a typical ControlNet, it is integrated into the transformer and relies on operations performed in the transformer's forward. In this PR we implement it as a typical ControlNet by duplicating the necessary operations from the transformer's forward into the ControlNet's forward and pass transformer to ZImageControlNetModel's forward to access the necessary transformer modules, as a result this is perhaps a little slower than the original implementation, but it keeps things clean and in style. ZImageTransformer2DModel has minimal changes, controlnet_block_samples is introduced, this is a Dict[int, torch.Tensor] returned from ZImageControlNetModel where the int is the ZImageTransformer2DModel layers index, this is another difference from typical ControlNet where every block has the ControlNet output applied. ZImageControlNetPipeline has minimal changes, compared to ZImagePipeline it adds prepare_image function, adds control_image and controlnet_conditioning_scale parameters, prepares and encodes control_image and calls controlnet to obtain controlnet_block_samples which are passed to transformer. control_guidance_start/control_guidance_end is not yet implemented.
Test code
python scripts/convert_z_image_controlnet_to_diffusers.py --original_controlnet_repo_id "alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union" --filename "Z-Image-Turbo-Fun-Controlnet-Union.safetensors" --output_path "z-image-controlnet-hf"
import torch
from diffusers import ZImageControlNetPipeline
from diffusers import ZImageControlNetModel
from diffusers.utils import load_image
controlnet_model = "z-image-controlnet-hf"
controlnet = ZImageControlNetModel.from_pretrained(
controlnet_model, torch_dtype=torch.bfloat16
)
pipe = ZImageControlNetPipeline.from_pretrained(
"Tongyi-MAI/Z-Image-Turbo", controlnet=controlnet, torch_dtype=torch.bfloat16
)
pipe = pipe.to("cuda")
control_image = load_image("https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union/resolve/main/asset/pose.jpg?download=true")
prompt = "一位年轻女子站在阳光明媚的海岸线上,白裙在轻拂的海风中微微飘动。她拥有一头鲜艳的紫色长发,在风中轻盈舞动,发间系着一个精致的黑色蝴蝶结,与身后柔和的蔚蓝天空形成鲜明对比。她面容清秀,眉目精致,透着一股甜美的青春气息;神情柔和,略带羞涩,目光静静地凝望着远方的地平线,双手自然交叠于身前,仿佛沉浸在思绪之中。在她身后,是辽阔无垠、波光粼粼的大海,阳光洒在海面上,映出温暖的金色光晕。"
image = pipe(
prompt,
control_image=control_image,
controlnet_conditioning_scale=0.75,
height=1728,
width=992,
num_inference_steps=9,
guidance_scale=0.0,
generator=torch.Generator("cuda").manual_seed(43),
).images[0]
image.save("zimage.png")
Output
| PR | Original |
|---|---|
Fixes #12769
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
How do we support control modes other than Canny? Other control modes still produce poor results—here’s what I got when I tried the HED example from the demo with prompt “A man holding a bottle” and input image.
How do we support control modes other than Canny? Other control modes still produce poor results—here’s what I got when I tried the HED example from the demo with prompt “A man holding a bottle” and input image.
Hi, I ran a test with your image and got the following result:
GT:
HED:
Result 1:
Steps: 9
CFG: 0
Control Scale: 0.7
Prompt: A man holding a bottle
Result 2: Steps: 9 CFG: 2.5 Control Scale: 0.75 Prompt: raw photo, portrait of a handsome Asian man sitting at a wooden table, holding a green glass bottle, wearing a black sweater, wristwatch, highly detailed skin texture, realistic pores, serious gaze, soft cinematic lighting, rim lighting, balanced exposure, 8k uhd, dslr, sharp focus, wood grain texture. Negative prompt: underexposed, crushed blacks, too dark, heavy shadows, makeup, smooth skin, plastic, wax, cartoon, illustration, distorted hands, bad anatomy, blur, haze, flat lighting.
To achieve a realistic effect, you will need to apply the Hires.Fix technique to the image after it has been generated:
like this:
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
ohm I thought about this a bit more. I think we can should try an Option 3 that's a middle ground between Option 1 and 2: instead of combining everything into one model, the controlnet only load shared layers from the transformer. so your from_transfomer would look lsomething like this
class ZImageControlNetModel:
@classmethod
def from_transformer(cls, controlnet, transformer):
....
controlnet.t_embedder = transformer.t_embedder
controlnet.all_x_embedder = transformer.all_x_embedder
controlnet.cap_embedder = transformer.cap_embedder
return controlnet
in pipeline, we still have both controlnet and transformer components and should work similarly to our other contorlnet pipelines
what do you think?
@yiyixuxu Option 3 sounds good to me, I've made those changes here a00f104. Let me know if you have any further comments, I will add support for from_single_file next.
from_single_file:
import torch
from diffusers import ZImageControlNetModel
from huggingface_hub import hf_hub_download
controlnet = ZImageControlNetModel.from_single_file(
hf_hub_download(
"alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union",
filename="Z-Image-Turbo-Fun-Controlnet-Union.safetensors",
),
torch_dtype=torch.bfloat16,
)
from_single_file:import torch from diffusers import ZImageControlNetModel from huggingface_hub import hf_hub_download controlnet = ZImageControlNetModel.from_single_file( hf_hub_download( "alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union", filename="Z-Image-Turbo-Fun-Controlnet-Union.safetensors", ), torch_dtype=torch.bfloat16, )
@hlky If you need a ControlNet gguf to test the loading as well, I can try to generate one for you. I'm currently testing this model and trying build an image restoration pipeline, Here is the link to the unified model I mentioned earlier If you want to check it out, go to https://huggingface.co/elismasilva/z-image-control-turbo-unified, but I'll switch to using your implementation later.
cc @DN6 can you take a look for the single file?
I've added # Copied from statement and removed imports from z-image transformer
alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0 is released, https://github.com/huggingface/diffusers/pull/12792/commits/04388f4698b303785b26bec6179a55aea652a388 should be ok for the modeling changes, will add inpaint pipeline and test inference later.
Loading v2 checkpoint is tested:
import torch
from diffusers import ZImageControlNetModel
from huggingface_hub import hf_hub_download
controlnet = ZImageControlNetModel.from_single_file(
hf_hub_download(
"alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0",
filename="Z-Image-Turbo-Fun-Controlnet-Union-2.0.safetensors",
),
torch_dtype=torch.bfloat16,
)
Note: This demonstrates expanded usage of introduced config_create_fn, create_z_image_controlnet_config now uses checkpoint to find shape of specific layer and return appropriate fixed config, in this case fixed config makes sense as we cannot reliably determine the exact control_layers_places or control_refiner_layers_places, but in other cases config_create_fn functions could potentially produce the entire config dynamically by checking for existence of certain keys, dimensions of certain weights, number of layers with a certain prefix, etc.
1.0
import torch
from diffusers import ZImageControlNetPipeline
from diffusers import ZImageControlNetModel
from diffusers.utils import load_image
from huggingface_hub import hf_hub_download
controlnet = ZImageControlNetModel.from_single_file(
hf_hub_download(
"alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union",
filename="Z-Image-Turbo-Fun-Controlnet-Union.safetensors",
),
torch_dtype=torch.bfloat16,
)
pipe = ZImageControlNetPipeline.from_pretrained(
"Tongyi-MAI/Z-Image-Turbo", controlnet=controlnet, torch_dtype=torch.bfloat16
)
pipe = pipe.to("cuda")
control_image = load_image("https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union/resolve/main/asset/pose.jpg?download=true")
prompt = "一位年轻女子站在阳光明媚的海岸线上,白裙在轻拂的海风中微微飘动。她拥有一头鲜艳的紫色长发,在风中轻盈舞动,发间系着一个精致的黑色蝴蝶结,与身后柔和的蔚蓝天空形成鲜明对比。她面容清秀,眉目精致,透着一股甜美的青春气息;神情柔和,略带羞涩,目光静静地凝望着远方的地平线,双手自然交叠于身前,仿佛沉浸在思绪之中。在她身后,是辽阔无垠、波光粼粼的大海,阳光洒在海面上,映出温暖的金色光晕。"
image = pipe(
prompt,
control_image=control_image,
controlnet_conditioning_scale=0.75,
height=1728,
width=992,
num_inference_steps=9,
guidance_scale=0.0,
generator=torch.Generator("cuda").manual_seed(43),
).images[0]
image.save("z-image_controlnet-1.png")
2.0 T2I
Note: 2.0 requires more inference steps, using same prompt as 1.0 here so it's different than the official example which changed prompt between 1.0 and 2.0
import torch
from diffusers import ZImageControlNetPipeline
from diffusers import ZImageControlNetModel
from diffusers.utils import load_image
from huggingface_hub import hf_hub_download
controlnet = ZImageControlNetModel.from_single_file(
hf_hub_download(
"alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0",
filename="Z-Image-Turbo-Fun-Controlnet-Union-2.0.safetensors",
),
torch_dtype=torch.bfloat16,
)
pipe = ZImageControlNetPipeline.from_pretrained(
"Tongyi-MAI/Z-Image-Turbo", controlnet=controlnet, torch_dtype=torch.bfloat16
)
pipe = pipe.to("cuda")
control_image = load_image("https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0/resolve/main/asset/pose.jpg?download=true")
prompt = "一位年轻女子站在阳光明媚的海岸线上,白裙在轻拂的海风中微微飘动。她拥有一头鲜艳的紫色长发,在风中轻盈舞动,发间系着一个精致的黑色蝴蝶结,与身后柔和的蔚蓝天空形成鲜明对比。她面容清秀,眉目精致,透着一股甜美的青春气息;神情柔和,略带羞涩,目光静静地凝望着远方的地平线,双手自然交叠于身前,仿佛沉浸在思绪之中。在她身后,是辽阔无垠、波光粼粼的大海,阳光洒在海面上,映出温暖的金色光晕。"
image = pipe(
prompt,
control_image=control_image,
controlnet_conditioning_scale=0.75,
height=1728,
width=992,
num_inference_steps=25,
guidance_scale=0.0,
generator=torch.Generator("cuda").manual_seed(43),
).images[0]
image.save("z-image_controlnet-2.png")
2.0 Inpaint
Note: Using the same prompt as official example, different than prompt in above examples
import torch
from diffusers import ZImageControlNetInpaintPipeline
from diffusers import ZImageControlNetModel
from diffusers.utils import load_image
from huggingface_hub import hf_hub_download
controlnet = ZImageControlNetModel.from_single_file(
hf_hub_download(
"alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0",
filename="Z-Image-Turbo-Fun-Controlnet-Union-2.0.safetensors",
),
torch_dtype=torch.bfloat16,
)
pipe = ZImageControlNetInpaintPipeline.from_pretrained(
"Tongyi-MAI/Z-Image-Turbo", controlnet=controlnet, torch_dtype=torch.bfloat16
)
pipe.to("cuda")
image = load_image(
"https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0/resolve/main/asset/inpaint.jpg?download=true"
)
mask_image = load_image(
"https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0/resolve/main/asset/mask.jpg?download=true"
)
control_image = load_image(
"https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0/resolve/main/asset/pose.jpg?download=true"
)
prompt = "一位年轻女子站在阳光明媚的海岸线上,画面为全身竖构图,身体微微侧向右侧,左手自然下垂,右臂弯曲扶在腰间,她的手指清晰可见,站姿放松而略带羞涩。她身穿轻盈的白色连衣裙,裙摆在海风中轻轻飘动,布料半透、质感柔软。女子拥有一头鲜艳的及腰紫色长发,被海风吹起,在身侧轻盈飞舞,发间系着一个精致的黑色蝴蝶结,与发色形成对比。她面容清秀,眉目精致,肤色白皙细腻,表情温柔略显羞涩,微微低头,眼神静静望向远处的海平线,流露出甜美的青春气息与若有所思的神情。背景是辽阔无垠的海洋与蔚蓝天空,阳光从侧前方洒下,海面波光粼粼,泛着温暖的金色光晕,天空清澈明亮,云朵稀薄,整体色调清新唯美。"
image = pipe(
prompt,
image=image,
mask_image=mask_image,
control_image=control_image,
controlnet_conditioning_scale=0.75,
height=1728,
width=992,
num_inference_steps=25,
guidance_scale=0.0,
generator=torch.Generator("cuda").manual_seed(43),
).images[0]
image.save("zimage-inpaint.png")
| 1.0 | 2.0 | Inpaint |
|---|---|---|
All working.
We can remove control_noise_refiner. from 2.0 state_dict and model as they are unused to save some vram, this did require changing _should_convert_state_dict_to_diffusers though as set(model_state_dict.keys()) is a subset of set(checkpoint_state_dict.keys()), this doesn't usually happen as from_single_file checkpoints typically have different keys, so I added a check for exact match of set(model_state_dict.keys()) and set(checkpoint_state_dict.keys()) as well. I don't think this will cause any issues with other checkpoint/model types.
We can remove
control_noise_refiner.from 2.0 state_dict and model as they are unused to save some vram, this did require changing_should_convert_state_dict_to_diffusersthough asset(model_state_dict.keys())is a subset ofset(checkpoint_state_dict.keys()), this doesn't usually happen as from_single_file checkpoints typically have different keys, so I added a check for exact match ofset(model_state_dict.keys())andset(checkpoint_state_dict.keys())as well. I don't think this will cause any issues with other checkpoint/model types.
It seems that this version 2.0 of Z-Image's controlnet will soon be replaced by another version, since they made a mistake training the model without using the model's control_noise_refiner. So I think it's not worth investing much in this version; they are training it again.
See this note for more details.
Thanks @elismasilva @iwr-redmond, I saw that, we will have to see what happens, for now 2.0 works - if another version is released I'll update this PR or make another.
Also, if you're curious about the potential speed difference of using control_refiner_layers - revert dd9775c and edit self.control_layers to self.control_refiner_layers https://github.com/huggingface/diffusers/blob/dd9775caf0906c6727fcbe2797cf6dc60cc38f45/src/diffusers/models/controlnets/controlnet_z_image.py#L690-L697
In my testing it was not much faster, so unless they change anything else don't expect a big improvement to inference times.
2.1 is up: https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0/tree/main
I've changed add_control_noise_refiner from bool to Literal["control_layers", "control_noise_refiner"] where control_layers is 2.0 and control_noise_refiner is 2.1.
Keys in the weights are the same between 2.0 and 2.1, we have no way to distinguish between them other than passing a config, so I've created hlky/Z-Image-Turbo-Fun-Controlnet-Union, hlky/Z-Image-Turbo-Fun-Controlnet-Union-2.0 and hlky/Z-Image-Turbo-Fun-Controlnet-Union-2.1. These include the weights, so we can also use from_pretrained. For from_single_file 1.0 vs 2.x is detected, and passing config="hlky/Z-Image-Turbo-Fun-Controlnet-Union-2.0" is only required for 2.0, the default config is for 2.1. config_create_fn related code is removed.
import torch
from diffusers import ZImageControlNetModel
controlnet = ZImageControlNetModel.from_single_file(
"https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union/blob/main/Z-Image-Turbo-Fun-Controlnet-Union.safetensors",
torch_dtype=torch.bfloat16,
)
controlnet = ZImageControlNetModel.from_single_file(
"https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0/blob/main/Z-Image-Turbo-Fun-Controlnet-Union-2.0.safetensors",
torch_dtype=torch.bfloat16,
config="hlky/Z-Image-Turbo-Fun-Controlnet-Union-2.0",
)
controlnet = ZImageControlNetModel.from_single_file(
"https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0/blob/main/Z-Image-Turbo-Fun-Controlnet-Union-2.1.safetensors",
torch_dtype=torch.bfloat16,
)
I've changed
add_control_noise_refinerfrombooltoLiteral["control_layers", "control_noise_refiner"]wherecontrol_layersis2.0andcontrol_noise_refineris2.1.Keys in the weights are the same between 2.0 and 2.1, we have no way to distinguish between them other than passing a config, so I've created hlky/Z-Image-Turbo-Fun-Controlnet-Union, hlky/Z-Image-Turbo-Fun-Controlnet-Union-2.0 and hlky/Z-Image-Turbo-Fun-Controlnet-Union-2.1. These include the weights, so we can also use
from_pretrained. Forfrom_single_file1.0 vs 2.x is detected, and passingconfig="hlky/Z-Image-Turbo-Fun-Controlnet-Union-2.0"is only required for 2.0, the default config is for 2.1.config_create_fnrelated code is removed.import torch from diffusers import ZImageControlNetModel controlnet = ZImageControlNetModel.from_single_file( "https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union/blob/main/Z-Image-Turbo-Fun-Controlnet-Union.safetensors", torch_dtype=torch.bfloat16, ) controlnet = ZImageControlNetModel.from_single_file( "https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0/blob/main/Z-Image-Turbo-Fun-Controlnet-Union-2.0.safetensors", torch_dtype=torch.bfloat16, config="hlky/Z-Image-Turbo-Fun-Controlnet-Union-2.0", ) controlnet = ZImageControlNetModel.from_single_file( "https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0/blob/main/Z-Image-Turbo-Fun-Controlnet-Union-2.1.safetensors", torch_dtype=torch.bfloat16, )
ive made a commentary here https://github.com/aigc-apps/VideoX-Fun/pull/404.
I think he's willing to make the model available correctly according to the standard; if I'm wrong, you can help out there.
@elismasilva I don't think there's any need for them to make changes to their own repo, at most we should make a PR with the config only to their Hub repos, but there'd still need to be another Hub repo for the 2.0 vs 2.1 config.
For now, I think it is fine, we have Hub repos for all versions which can be used with from_pretrained, and the original weights can be used with from_single_file, the only caveat is that users need to pass config="hlky/Z-Image-Turbo-Fun-Controlnet-Union-2.0" for 2.0.
Edit: We could detect 2.0 vs 2.1 based on the contents of weights then we could PR a config to the original Hub repo and switch the 2.1 or 2.0 config based on weight content detection, but I think it's ok the way it is just uploading configs to the Hub for each version, we get from_pretrained support this way too.
@hlky Ah yes, actually I only suggested it in case he also wants to align with that standard. Locally I did what you said, I removed the weights from the 2.0 refiner just to make it lighter. But I think in his case I don't see the need for a 2.1, just the updated 2.0 with the correct weights.
Prompt: 一位年轻女子站在阳光明媚的海岸线上,白裙在轻拂的海风中微微飘动。她拥有一头鲜艳的紫色长发,在风中轻盈舞动,发间系着一个精致的黑色蝴蝶结,与身后柔和的蔚蓝天空形成鲜明对比。她面容清秀,眉目精致,透着一股甜美的青春气息;神情柔和,略带羞涩,目光静静地凝望着远方的地平线,双手自然交叠于身前,仿佛沉浸在思绪之中。在她身后,是辽阔无垠、波光粼粼的大海,阳光洒在海面上,映出温暖的金色光晕。
Control image
Note: Parameters (and seed) match original code, but the same prompt is used here, in the original code examples the prompt is changed between versions.
1.0: 9 steps
2.x: 25 steps
Scale: 0.75
A40 (runpod)
| 1.0 | 2.0 | 2.1 |
|---|---|---|
| [00:18<00:00, 2.02s/it] | [01:21<00:00, 3.24s/it] | [01:04<00:00, 2.57s/it] |
Prompt: 一位年轻女子站在阳光明媚的海岸线上,白裙在轻拂的海风中微微飘动。她拥有一头鲜艳的紫色长发,在风中轻盈舞动,发间系着一个精致的黑色蝴蝶结,与身后柔和的蔚蓝天空形成鲜明对比。她面容清秀,眉目精致,透着一股甜美的青春气息;神情柔和,略带羞涩,目光静静地凝望着远方的地平线,双手自然交叠于身前,仿佛沉浸在思绪之中。在她身后,是辽阔无垠、波光粼粼的大海,阳光洒在海面上,映出温暖的金色光晕。
Control image
Inpaint image
Mask image
| 2.0 | 2.1 |
|---|---|
| [01:22<00:00, 3.29s/it] | [01:05<00:00, 2.60s/it] |
Relatively there is not much speed difference between 2.0 and 2.1, for text-to-image case both 2.0 and 2.1 seem to perform poorly with hands, same for inpaint case but 2.1 does better colour matching.
Also just to note these results do not match the results from the Hub repo but neither do the results when using the original code - I assume a different prompt, scale or number of inference steps was used for the Hub results.
@bot /style
Style bot fixed some files and pushed the changes.
