diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

Z-Image-Turbo ControlNet

Open hlky opened this issue 4 weeks ago • 12 comments

What does this PR do?

In the original code this is not a typical ControlNet, it is integrated into the transformer and relies on operations performed in the transformer's forward. In this PR we implement it as a typical ControlNet by duplicating the necessary operations from the transformer's forward into the ControlNet's forward and pass transformer to ZImageControlNetModel's forward to access the necessary transformer modules, as a result this is perhaps a little slower than the original implementation, but it keeps things clean and in style. ZImageTransformer2DModel has minimal changes, controlnet_block_samples is introduced, this is a Dict[int, torch.Tensor] returned from ZImageControlNetModel where the int is the ZImageTransformer2DModel layers index, this is another difference from typical ControlNet where every block has the ControlNet output applied. ZImageControlNetPipeline has minimal changes, compared to ZImagePipeline it adds prepare_image function, adds control_image and controlnet_conditioning_scale parameters, prepares and encodes control_image and calls controlnet to obtain controlnet_block_samples which are passed to transformer. control_guidance_start/control_guidance_end is not yet implemented.

Test code

python scripts/convert_z_image_controlnet_to_diffusers.py --original_controlnet_repo_id "alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union" --filename "Z-Image-Turbo-Fun-Controlnet-Union.safetensors" --output_path "z-image-controlnet-hf"
import torch
from diffusers import ZImageControlNetPipeline
from diffusers import ZImageControlNetModel
from diffusers.utils import load_image

controlnet_model = "z-image-controlnet-hf"
controlnet = ZImageControlNetModel.from_pretrained(
    controlnet_model, torch_dtype=torch.bfloat16
)
pipe = ZImageControlNetPipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo", controlnet=controlnet, torch_dtype=torch.bfloat16
)
pipe = pipe.to("cuda")
control_image = load_image("https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union/resolve/main/asset/pose.jpg?download=true")
prompt = "一位年轻女子站在阳光明媚的海岸线上,白裙在轻拂的海风中微微飘动。她拥有一头鲜艳的紫色长发,在风中轻盈舞动,发间系着一个精致的黑色蝴蝶结,与身后柔和的蔚蓝天空形成鲜明对比。她面容清秀,眉目精致,透着一股甜美的青春气息;神情柔和,略带羞涩,目光静静地凝望着远方的地平线,双手自然交叠于身前,仿佛沉浸在思绪之中。在她身后,是辽阔无垠、波光粼粼的大海,阳光洒在海面上,映出温暖的金色光晕。"
image = pipe(
    prompt,
    control_image=control_image,
    controlnet_conditioning_scale=0.75,
    height=1728,
    width=992,
    num_inference_steps=9,
    guidance_scale=0.0,
    generator=torch.Generator("cuda").manual_seed(43),
).images[0]
image.save("zimage.png")

Output

PR Original
z-image_controlnet 00000003

Fixes #12769

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

hlky avatar Dec 04 '25 21:12 hlky

How do we support control modes other than Canny? Other control modes still produce poor results—here’s what I got when I tried the HED example from the demo with prompt “A man holding a bottle” and input image. z2

e1ijah1 avatar Dec 09 '25 06:12 e1ijah1

How do we support control modes other than Canny? Other control modes still produce poor results—here’s what I got when I tried the HED example from the demo with prompt “A man holding a bottle” and input image. z2

Hi, I ran a test with your image and got the following result:

GT: image

HED: image

Result 1: Steps: 9 CFG: 0 Control Scale: 0.7 Prompt: A man holding a bottle image

Result 2: Steps: 9 CFG: 2.5 Control Scale: 0.75 Prompt: raw photo, portrait of a handsome Asian man sitting at a wooden table, holding a green glass bottle, wearing a black sweater, wristwatch, highly detailed skin texture, realistic pores, serious gaze, soft cinematic lighting, rim lighting, balanced exposure, 8k uhd, dslr, sharp focus, wood grain texture. Negative prompt: underexposed, crushed blacks, too dark, heavy shadows, makeup, smooth skin, plastic, wax, cartoon, illustration, distorted hands, bad anatomy, blur, haze, flat lighting.

image

To achieve a realistic effect, you will need to apply the Hires.Fix technique to the image after it has been generated: like this: image

elismasilva avatar Dec 09 '25 10:12 elismasilva

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ohm I thought about this a bit more. I think we can should try an Option 3 that's a middle ground between Option 1 and 2: instead of combining everything into one model, the controlnet only load shared layers from the transformer. so your from_transfomer would look lsomething like this

class ZImageControlNetModel:
    @classmethod
    def from_transformer(cls, controlnet, transformer):
        ....
        controlnet.t_embedder = transformer.t_embedder
        controlnet.all_x_embedder = transformer.all_x_embedder
        controlnet.cap_embedder = transformer.cap_embedder
        return controlnet

in pipeline, we still have both controlnet and transformer components and should work similarly to our other contorlnet pipelines

what do you think?

yiyixuxu avatar Dec 11 '25 00:12 yiyixuxu

@yiyixuxu Option 3 sounds good to me, I've made those changes here a00f104. Let me know if you have any further comments, I will add support for from_single_file next.

hlky avatar Dec 11 '25 01:12 hlky

from_single_file:

import torch
from diffusers import ZImageControlNetModel
from huggingface_hub import hf_hub_download

controlnet = ZImageControlNetModel.from_single_file(
    hf_hub_download(
        "alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union",
        filename="Z-Image-Turbo-Fun-Controlnet-Union.safetensors",
    ),
    torch_dtype=torch.bfloat16,
)

hlky avatar Dec 11 '25 03:12 hlky

from_single_file:

import torch
from diffusers import ZImageControlNetModel
from huggingface_hub import hf_hub_download

controlnet = ZImageControlNetModel.from_single_file(
    hf_hub_download(
        "alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union",
        filename="Z-Image-Turbo-Fun-Controlnet-Union.safetensors",
    ),
    torch_dtype=torch.bfloat16,
)

@hlky If you need a ControlNet gguf to test the loading as well, I can try to generate one for you. I'm currently testing this model and trying build an image restoration pipeline, Here is the link to the unified model I mentioned earlier If you want to check it out, go to https://huggingface.co/elismasilva/z-image-control-turbo-unified, but I'll switch to using your implementation later.

elismasilva avatar Dec 11 '25 04:12 elismasilva

cc @DN6 can you take a look for the single file?

yiyixuxu avatar Dec 11 '25 19:12 yiyixuxu

I've added # Copied from statement and removed imports from z-image transformer

hlky avatar Dec 11 '25 21:12 hlky

alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0 is released, https://github.com/huggingface/diffusers/pull/12792/commits/04388f4698b303785b26bec6179a55aea652a388 should be ok for the modeling changes, will add inpaint pipeline and test inference later.

Loading v2 checkpoint is tested:

import torch
from diffusers import ZImageControlNetModel
from huggingface_hub import hf_hub_download

controlnet = ZImageControlNetModel.from_single_file(
    hf_hub_download(
        "alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0",
        filename="Z-Image-Turbo-Fun-Controlnet-Union-2.0.safetensors",
    ),
    torch_dtype=torch.bfloat16,
)

Note: This demonstrates expanded usage of introduced config_create_fn, create_z_image_controlnet_config now uses checkpoint to find shape of specific layer and return appropriate fixed config, in this case fixed config makes sense as we cannot reliably determine the exact control_layers_places or control_refiner_layers_places, but in other cases config_create_fn functions could potentially produce the entire config dynamically by checking for existence of certain keys, dimensions of certain weights, number of layers with a certain prefix, etc.

hlky avatar Dec 13 '25 11:12 hlky

1.0

import torch
from diffusers import ZImageControlNetPipeline
from diffusers import ZImageControlNetModel
from diffusers.utils import load_image
from huggingface_hub import hf_hub_download

controlnet = ZImageControlNetModel.from_single_file(
    hf_hub_download(
        "alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union",
        filename="Z-Image-Turbo-Fun-Controlnet-Union.safetensors",
    ),
    torch_dtype=torch.bfloat16,
)
pipe = ZImageControlNetPipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo", controlnet=controlnet, torch_dtype=torch.bfloat16
)
pipe = pipe.to("cuda")
control_image = load_image("https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union/resolve/main/asset/pose.jpg?download=true")
prompt = "一位年轻女子站在阳光明媚的海岸线上,白裙在轻拂的海风中微微飘动。她拥有一头鲜艳的紫色长发,在风中轻盈舞动,发间系着一个精致的黑色蝴蝶结,与身后柔和的蔚蓝天空形成鲜明对比。她面容清秀,眉目精致,透着一股甜美的青春气息;神情柔和,略带羞涩,目光静静地凝望着远方的地平线,双手自然交叠于身前,仿佛沉浸在思绪之中。在她身后,是辽阔无垠、波光粼粼的大海,阳光洒在海面上,映出温暖的金色光晕。"
image = pipe(
    prompt,
    control_image=control_image,
    controlnet_conditioning_scale=0.75,
    height=1728,
    width=992,
    num_inference_steps=9,
    guidance_scale=0.0,
    generator=torch.Generator("cuda").manual_seed(43),
).images[0]
image.save("z-image_controlnet-1.png")

2.0 T2I

Note: 2.0 requires more inference steps, using same prompt as 1.0 here so it's different than the official example which changed prompt between 1.0 and 2.0

import torch
from diffusers import ZImageControlNetPipeline
from diffusers import ZImageControlNetModel
from diffusers.utils import load_image
from huggingface_hub import hf_hub_download

controlnet = ZImageControlNetModel.from_single_file(
    hf_hub_download(
        "alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0",
        filename="Z-Image-Turbo-Fun-Controlnet-Union-2.0.safetensors",
    ),
    torch_dtype=torch.bfloat16,
)
pipe = ZImageControlNetPipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo", controlnet=controlnet, torch_dtype=torch.bfloat16
)
pipe = pipe.to("cuda")
control_image = load_image("https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0/resolve/main/asset/pose.jpg?download=true")
prompt = "一位年轻女子站在阳光明媚的海岸线上,白裙在轻拂的海风中微微飘动。她拥有一头鲜艳的紫色长发,在风中轻盈舞动,发间系着一个精致的黑色蝴蝶结,与身后柔和的蔚蓝天空形成鲜明对比。她面容清秀,眉目精致,透着一股甜美的青春气息;神情柔和,略带羞涩,目光静静地凝望着远方的地平线,双手自然交叠于身前,仿佛沉浸在思绪之中。在她身后,是辽阔无垠、波光粼粼的大海,阳光洒在海面上,映出温暖的金色光晕。"
image = pipe(
    prompt,
    control_image=control_image,
    controlnet_conditioning_scale=0.75,
    height=1728,
    width=992,
    num_inference_steps=25,
    guidance_scale=0.0,
    generator=torch.Generator("cuda").manual_seed(43),
).images[0]
image.save("z-image_controlnet-2.png")

2.0 Inpaint

Note: Using the same prompt as official example, different than prompt in above examples

import torch
from diffusers import ZImageControlNetInpaintPipeline
from diffusers import ZImageControlNetModel
from diffusers.utils import load_image
from huggingface_hub import hf_hub_download

controlnet = ZImageControlNetModel.from_single_file(
    hf_hub_download(
        "alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0",
        filename="Z-Image-Turbo-Fun-Controlnet-Union-2.0.safetensors",
    ),
    torch_dtype=torch.bfloat16,
)
pipe = ZImageControlNetInpaintPipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo", controlnet=controlnet, torch_dtype=torch.bfloat16
)
pipe.to("cuda")
image = load_image(
    "https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0/resolve/main/asset/inpaint.jpg?download=true"
)
mask_image = load_image(
    "https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0/resolve/main/asset/mask.jpg?download=true"
)
control_image = load_image(
    "https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0/resolve/main/asset/pose.jpg?download=true"
)
prompt = "一位年轻女子站在阳光明媚的海岸线上,画面为全身竖构图,身体微微侧向右侧,左手自然下垂,右臂弯曲扶在腰间,她的手指清晰可见,站姿放松而略带羞涩。她身穿轻盈的白色连衣裙,裙摆在海风中轻轻飘动,布料半透、质感柔软。女子拥有一头鲜艳的及腰紫色长发,被海风吹起,在身侧轻盈飞舞,发间系着一个精致的黑色蝴蝶结,与发色形成对比。她面容清秀,眉目精致,肤色白皙细腻,表情温柔略显羞涩,微微低头,眼神静静望向远处的海平线,流露出甜美的青春气息与若有所思的神情。背景是辽阔无垠的海洋与蔚蓝天空,阳光从侧前方洒下,海面波光粼粼,泛着温暖的金色光晕,天空清澈明亮,云朵稀薄,整体色调清新唯美。"
image = pipe(
    prompt,
    image=image,
    mask_image=mask_image,
    control_image=control_image,
    controlnet_conditioning_scale=0.75,
    height=1728,
    width=992,
    num_inference_steps=25,
    guidance_scale=0.0,
    generator=torch.Generator("cuda").manual_seed(43),
).images[0]
image.save("zimage-inpaint.png")

1.0 2.0 Inpaint
z-image_controlnet-1 z-image_controlnet-2 zimage-inpaint

All working.

hlky avatar Dec 13 '25 13:12 hlky

We can remove control_noise_refiner. from 2.0 state_dict and model as they are unused to save some vram, this did require changing _should_convert_state_dict_to_diffusers though as set(model_state_dict.keys()) is a subset of set(checkpoint_state_dict.keys()), this doesn't usually happen as from_single_file checkpoints typically have different keys, so I added a check for exact match of set(model_state_dict.keys()) and set(checkpoint_state_dict.keys()) as well. I don't think this will cause any issues with other checkpoint/model types.

hlky avatar Dec 13 '25 14:12 hlky

We can remove control_noise_refiner. from 2.0 state_dict and model as they are unused to save some vram, this did require changing _should_convert_state_dict_to_diffusers though as set(model_state_dict.keys()) is a subset of set(checkpoint_state_dict.keys()), this doesn't usually happen as from_single_file checkpoints typically have different keys, so I added a check for exact match of set(model_state_dict.keys()) and set(checkpoint_state_dict.keys()) as well. I don't think this will cause any issues with other checkpoint/model types.

It seems that this version 2.0 of Z-Image's controlnet will soon be replaced by another version, since they made a mistake training the model without using the model's control_noise_refiner. So I think it's not worth investing much in this version; they are training it again.

elismasilva avatar Dec 16 '25 23:12 elismasilva

See this note for more details.

iwr-redmond avatar Dec 16 '25 23:12 iwr-redmond

Thanks @elismasilva @iwr-redmond, I saw that, we will have to see what happens, for now 2.0 works - if another version is released I'll update this PR or make another.

Also, if you're curious about the potential speed difference of using control_refiner_layers - revert dd9775c and edit self.control_layers to self.control_refiner_layers https://github.com/huggingface/diffusers/blob/dd9775caf0906c6727fcbe2797cf6dc60cc38f45/src/diffusers/models/controlnets/controlnet_z_image.py#L690-L697

In my testing it was not much faster, so unless they change anything else don't expect a big improvement to inference times.

hlky avatar Dec 17 '25 09:12 hlky

2.1 is up: https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0/tree/main

tin2tin avatar Dec 17 '25 10:12 tin2tin

24f454c

I've changed add_control_noise_refiner from bool to Literal["control_layers", "control_noise_refiner"] where control_layers is 2.0 and control_noise_refiner is 2.1.

Keys in the weights are the same between 2.0 and 2.1, we have no way to distinguish between them other than passing a config, so I've created hlky/Z-Image-Turbo-Fun-Controlnet-Union, hlky/Z-Image-Turbo-Fun-Controlnet-Union-2.0 and hlky/Z-Image-Turbo-Fun-Controlnet-Union-2.1. These include the weights, so we can also use from_pretrained. For from_single_file 1.0 vs 2.x is detected, and passing config="hlky/Z-Image-Turbo-Fun-Controlnet-Union-2.0" is only required for 2.0, the default config is for 2.1. config_create_fn related code is removed.

import torch
from diffusers import ZImageControlNetModel

controlnet = ZImageControlNetModel.from_single_file(
    "https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union/blob/main/Z-Image-Turbo-Fun-Controlnet-Union.safetensors",
    torch_dtype=torch.bfloat16,
)


controlnet = ZImageControlNetModel.from_single_file(
    "https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0/blob/main/Z-Image-Turbo-Fun-Controlnet-Union-2.0.safetensors",
    torch_dtype=torch.bfloat16,
    config="hlky/Z-Image-Turbo-Fun-Controlnet-Union-2.0",
)


controlnet = ZImageControlNetModel.from_single_file(
    "https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0/blob/main/Z-Image-Turbo-Fun-Controlnet-Union-2.1.safetensors",
    torch_dtype=torch.bfloat16,
)

hlky avatar Dec 17 '25 11:12 hlky

24f454c

I've changed add_control_noise_refiner from bool to Literal["control_layers", "control_noise_refiner"] where control_layers is 2.0 and control_noise_refiner is 2.1.

Keys in the weights are the same between 2.0 and 2.1, we have no way to distinguish between them other than passing a config, so I've created hlky/Z-Image-Turbo-Fun-Controlnet-Union, hlky/Z-Image-Turbo-Fun-Controlnet-Union-2.0 and hlky/Z-Image-Turbo-Fun-Controlnet-Union-2.1. These include the weights, so we can also use from_pretrained. For from_single_file 1.0 vs 2.x is detected, and passing config="hlky/Z-Image-Turbo-Fun-Controlnet-Union-2.0" is only required for 2.0, the default config is for 2.1. config_create_fn related code is removed.

import torch
from diffusers import ZImageControlNetModel

controlnet = ZImageControlNetModel.from_single_file(
    "https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union/blob/main/Z-Image-Turbo-Fun-Controlnet-Union.safetensors",
    torch_dtype=torch.bfloat16,
)


controlnet = ZImageControlNetModel.from_single_file(
    "https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0/blob/main/Z-Image-Turbo-Fun-Controlnet-Union-2.0.safetensors",
    torch_dtype=torch.bfloat16,
    config="hlky/Z-Image-Turbo-Fun-Controlnet-Union-2.0",
)


controlnet = ZImageControlNetModel.from_single_file(
    "https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0/blob/main/Z-Image-Turbo-Fun-Controlnet-Union-2.1.safetensors",
    torch_dtype=torch.bfloat16,
)

ive made a commentary here https://github.com/aigc-apps/VideoX-Fun/pull/404.

I think he's willing to make the model available correctly according to the standard; if I'm wrong, you can help out there.

elismasilva avatar Dec 17 '25 13:12 elismasilva

@elismasilva I don't think there's any need for them to make changes to their own repo, at most we should make a PR with the config only to their Hub repos, but there'd still need to be another Hub repo for the 2.0 vs 2.1 config.

For now, I think it is fine, we have Hub repos for all versions which can be used with from_pretrained, and the original weights can be used with from_single_file, the only caveat is that users need to pass config="hlky/Z-Image-Turbo-Fun-Controlnet-Union-2.0" for 2.0.

Edit: We could detect 2.0 vs 2.1 based on the contents of weights then we could PR a config to the original Hub repo and switch the 2.1 or 2.0 config based on weight content detection, but I think it's ok the way it is just uploading configs to the Hub for each version, we get from_pretrained support this way too.

hlky avatar Dec 17 '25 13:12 hlky

@hlky Ah yes, actually I only suggested it in case he also wants to align with that standard. Locally I did what you said, I removed the weights from the 2.0 refiner just to make it lighter. But I think in his case I don't see the need for a 2.1, just the updated 2.0 with the correct weights.

elismasilva avatar Dec 17 '25 13:12 elismasilva

Prompt: 一位年轻女子站在阳光明媚的海岸线上,白裙在轻拂的海风中微微飘动。她拥有一头鲜艳的紫色长发,在风中轻盈舞动,发间系着一个精致的黑色蝴蝶结,与身后柔和的蔚蓝天空形成鲜明对比。她面容清秀,眉目精致,透着一股甜美的青春气息;神情柔和,略带羞涩,目光静静地凝望着远方的地平线,双手自然交叠于身前,仿佛沉浸在思绪之中。在她身后,是辽阔无垠、波光粼粼的大海,阳光洒在海面上,映出温暖的金色光晕。 Control image

Note: Parameters (and seed) match original code, but the same prompt is used here, in the original code examples the prompt is changed between versions.

1.0: 9 steps

2.x: 25 steps

Scale: 0.75

A40 (runpod)

1.0 2.0 2.1
z-image_controlnet-1 0 z-image_controlnet-2 0 z-image_controlnet-2 1
[00:18<00:00, 2.02s/it] [01:21<00:00, 3.24s/it] [01:04<00:00, 2.57s/it]

Prompt: 一位年轻女子站在阳光明媚的海岸线上,白裙在轻拂的海风中微微飘动。她拥有一头鲜艳的紫色长发,在风中轻盈舞动,发间系着一个精致的黑色蝴蝶结,与身后柔和的蔚蓝天空形成鲜明对比。她面容清秀,眉目精致,透着一股甜美的青春气息;神情柔和,略带羞涩,目光静静地凝望着远方的地平线,双手自然交叠于身前,仿佛沉浸在思绪之中。在她身后,是辽阔无垠、波光粼粼的大海,阳光洒在海面上,映出温暖的金色光晕。 Control image Inpaint image Mask image

2.0 2.1
zimage-inpaint-2 0 zimage-inpaint-2 1
[01:22<00:00, 3.29s/it] [01:05<00:00, 2.60s/it]

Relatively there is not much speed difference between 2.0 and 2.1, for text-to-image case both 2.0 and 2.1 seem to perform poorly with hands, same for inpaint case but 2.1 does better colour matching.

Also just to note these results do not match the results from the Hub repo but neither do the results when using the original code - I assume a different prompt, scale or number of inference steps was used for the Hub results.

hlky avatar Dec 17 '25 13:12 hlky

@bot /style

yiyixuxu avatar Dec 17 '25 18:12 yiyixuxu

Style bot fixed some files and pushed the changes.

github-actions[bot] avatar Dec 17 '25 18:12 github-actions[bot]