ai-toolkit [z-image Turbo] Concept-slider : error "Batch size of latents must be the same or half the batch size of text embeddings"

This is for bugs only

Did you already ask in the discord?

No - I'm not using discord.

You verified that this is a bug and not a feature request or question by asking in the discord?

No - I'm not using discord.

Describe the bug

The error message comes from the file "/toolkit/models/base_model.py" line 817/818.

I spent some hours debugging and had to change the line 265 in "/toolkit/prompt_utils.py" to make it to work :

text_embeds = embed_list by text_embeds = padded

Like this :

# text_embeds = embed_list text_embeds = padded

Seems to work in my first test, but I didn't know what I have broken elsewhere......

EDIT : with the following answers, I should add : I've only 8Gb of VRAM, so I'm not using batch >1 and I should use "Unload TE".

Dec 02 '25 00:12 ZeTofZone

saw this as well if batch size/gradient_accumulation was set > 1

Dec 02 '25 02:12 siraxe

Did you already ask in the discord?

Yes, but did not find any conclusion on that issue

You verified that this is a bug and not a feature request or question by asking in the discord?

yes.

Since I added a few logs, I found that latents.shape[0]*2 == 4 while te_batch_size == 1

Dec 02 '25 12:12 zmq175

For the batch size > 1 case, I noticed this seems to only occur if text embedding caching is on.

Flipping off Cache Text Embeddings fixes the issue for me (even with batch size > 1) but results in higher VRAM usage.

Dec 02 '25 13:12 isaac-mcfadyen

For the batch size > 1 case, I noticed this seems to only occur if text embedding caching is on.

Flipping off Cache Text Embeddings fixes the issue for me (even with batch size > 1) but results in higher VRAM usage.

still failed to train concept lora. my configuration:

---
job: "extension"
config:
  name: "Breast_Slider"
  process:
    - type: "concept_slider"
      training_folder: "/root/ai-toolkit/output"
      sqlite_db_path: "./aitk_db.db"
      device: "cuda"
      trigger_word: null
      performance_log_every: 10
      network:
        type: "lora"
        linear: 4
        linear_alpha: 4
        conv: 16
        conv_alpha: 16
        lokr_full_rank: true
        lokr_factor: -1
        network_kwargs:
          ignore_if_contains: []
      save:
        dtype: "bf16"
        save_every: 25
        max_step_saves_to_keep: 4
        save_format: "diffusers"
        push_to_hub: false
      datasets:
        - folder_path: "/root/ai-toolkit/datasets/us"
          mask_path: null
          mask_min_value: 0.1
          default_caption: ""
          caption_ext: "txt"
          caption_dropout_rate: 0.05
          cache_latents_to_disk: false
          is_reg: false
          network_weight: 1
          resolution:
            - 512
          controls: []
          shrink_video_to_frames: true
          num_frames: 1
          do_i2v: true
          flip_x: false
          flip_y: false
      train:
        batch_size: 1
        bypass_guidance_embedding: false
        steps: 300
        gradient_accumulation: 1
        train_unet: true
        train_text_encoder: false
        gradient_checkpointing: true
        noise_scheduler: "flowmatch"
        optimizer: "adamw8bit"
        timestep_type: "weighted"
        content_or_style: "balanced"
        optimizer_params:
          weight_decay: 0.0001
        unload_text_encoder: false
        cache_text_embeddings: false
        lr: 0.0001
        ema_config:
          use_ema: false
          ema_decay: 0.99
        skip_first_sample: false
        force_first_sample: false
        disable_sampling: false
        dtype: "bf16"
        diff_output_preservation: false
        diff_output_preservation_multiplier: 1
        diff_output_preservation_class: "person"
        switch_boundary_every: 1
        loss_type: "mse"
      model:
        name_or_path: "Tongyi-MAI/Z-Image-Turbo"
        quantize: true
        qtype: "qfloat8"
        quantize_te: true
        qtype_te: "qfloat8"
        arch: "zimage:turbo"
        low_vram: false
        model_kwargs: {}
        layer_offloading: false
        layer_offloading_text_encoder_percent: 1
        layer_offloading_transformer_percent: 1
        assistant_lora_path: "ostris/zimage_turbo_training_adapter/zimage_turbo_training_adapter_v2.safetensors"
      sample:
        sampler: "flowmatch"
        sample_every: 25
        width: 1024
        height: 1024
        samples:
          - prompt: "woman with red hair, playing chess at the park, bomb going off in the background"
            seed: 42
            network_multiplier: "-2.0"
          - prompt: "woman with red hair, playing chess at the park, bomb going off in the background"
            seed: 42
            network_multiplier: "-1.0"
          - prompt: "woman with red hair, playing chess at the park, bomb going off in the background"
            seed: 42
            network_multiplier: "-0.5"
          - prompt: "woman with red hair, playing chess at the park, bomb going off in the background"
            seed: 42
            network_multiplier: "0.5"
          - prompt: "woman with red hair, playing chess at the park, bomb going off in the background"
            seed: 42
            network_multiplier: "1.0"
          - prompt: "woman with red hair, playing chess at the park, bomb going off in the background"
            seed: 42
            network_multiplier: "2.0"
        neg: ""
        seed: 42
        walk_seed: true
        guidance_scale: 1
        sample_steps: 8
        num_frames: 1
        fps: 1
      slider:
        guidance_strength: 3
        anchor_strength: 1
        positive_prompt: "person who has gigantic breasts"
        negative_prompt: "person who has flat chest"
        target_class: "person"
        anchor_class: ""
meta:
  name: "[name]"
  version: "1.0"

Error running job: Batch size of latents must be the same or half the batch size of text embeddings
========================================
Result:
 - 0 completed jobs
 - 1 failure
========================================
Traceback (most recent call last):
Traceback (most recent call last):
  File "/root/ai-toolkit/run.py", line 120, in <module>
    main()
    ~~~~^^
  File "/root/ai-toolkit/run.py", line 120, in <module>
    main()
    ~~~~^^
  File "/root/ai-toolkit/run.py", line 108, in main
    raise e
  File "/root/ai-toolkit/run.py", line 108, in main
    raise e
  File "/root/ai-toolkit/run.py", line 96, in main
    job.run()
    ~~~~~~~^^
  File "/root/ai-toolkit/run.py", line 96, in main
    job.run()
    ~~~~~~~^^
  File "/root/ai-toolkit/jobs/ExtensionJob.py", line 22, in run
    process.run()
    ~~~~~~~~~~~^^
  File "/root/ai-toolkit/jobs/ExtensionJob.py", line 22, in run
    process.run()
    ~~~~~~~~~~~^^
  File "/root/ai-toolkit/jobs/process/BaseSDTrainProcess.py", line 2162, in run
    loss_dict = self.hook_train_loop(batch_list)
  File "/root/ai-toolkit/jobs/process/BaseSDTrainProcess.py", line 2162, in run
    loss_dict = self.hook_train_loop(batch_list)
  File "/root/ai-toolkit/extensions_built_in/sd_trainer/SDTrainer.py", line 2055, in hook_train_loop
    loss = self.train_single_accumulation(batch)
  File "/root/ai-toolkit/extensions_built_in/sd_trainer/SDTrainer.py", line 2055, in hook_train_loop
    loss = self.train_single_accumulation(batch)
  File "/root/ai-toolkit/extensions_built_in/sd_trainer/SDTrainer.py", line 1922, in train_single_accumulation
    loss = self.get_guided_loss(
        noisy_latents=noisy_latents,
    ...<9 lines>...
        prior_pred=prior_pred,
    )
  File "/root/ai-toolkit/extensions_built_in/sd_trainer/SDTrainer.py", line 1922, in train_single_accumulation
    loss = self.get_guided_loss(
        noisy_latents=noisy_latents,
    ...<9 lines>...
        prior_pred=prior_pred,
    )
  File "/root/ai-toolkit/extensions_built_in/concept_slider/ConceptSliderTrainer.py", line 157, in get_guided_loss
    combo_pred = self.sd.predict_noise(
        latents=torch.cat([noisy_latents] * num_embeds, dim=0),
    ...<4 lines>...
        batch=batch,
    )
  File "/root/ai-toolkit/extensions_built_in/concept_slider/ConceptSliderTrainer.py", line 157, in get_guided_loss
    combo_pred = self.sd.predict_noise(
        latents=torch.cat([noisy_latents] * num_embeds, dim=0),
    ...<4 lines>...
        batch=batch,
    )
  File "/root/ai-toolkit/toolkit/models/base_model.py", line 817, in predict_noise
    raise ValueError(
        "Batch size of latents must be the same or half the batch size of text embeddings")
  File "/root/ai-toolkit/toolkit/models/base_model.py", line 817, in predict_noise
    raise ValueError(
        "Batch size of latents must be the same or half the batch size of text embeddings")
ValueError: Batch size of latents must be the same or half the batch size of text embeddings
ValueError: Batch size of latents must be the same or half the batch size of text embeddings
Breast_Slider:   0%|          | 0/300 [00:00<?, ?it/s]```

Dec 02 '25 14:12 zmq175

This is for bugs only

Did you already ask in the discord?

No - I'm not using discord.

You verified that this is a bug and not a feature request or question by asking in the discord?

No - I'm not using discord.

Describe the bug

The error message comes from the file "/toolkit/models/base_model.py" line 817/818.

I spent some hours debugging and had to change the line 265 in "/toolkit/prompt_utils.py" to make it to work :

text_embeds = embed_list by text_embeds = padded

Like this :

text_embeds = embed_list

text_embeds = padded

Seems to work in my first test, but I didn't know what I have broken elsewhere......

This did work for me. huge thanks!

Dec 02 '25 14:12 zmq175

This is for bugs only

Did you already ask in the discord? No - I'm not using discord. You verified that this is a bug and not a feature request or question by asking in the discord? No - I'm not using discord.

Describe the bug

The error message comes from the file "/toolkit/models/base_model.py" line 817/818. I spent some hours debugging and had to change the line 265 in "/toolkit/prompt_utils.py" to make it to work : text_embeds = embed_list by text_embeds = padded Like this :

text_embeds = embed_list

text_embeds = padded

Seems to work in my first test, but I didn't know what I have broken elsewhere......

This did work for me. huge thanks!

Great 👍

Take care as this has certainly broke something else ! So if you have another error when training with other checkpoint or normal lora (or lora doesn't seems to work or bad), don't forget to roll back this line.

Dec 02 '25 16:12 ZeTofZone

isaac-mcfadyen was right its the Cache Text Embeddings , works without it properly

Dec 02 '25 17:12 siraxe

I've edited my post :

As I have only 8Gb of VRAM, I'm using a batch size of 1 and I should use "Unload TE" as Ostris explained in his "lora slider training" video. I'm not using "Cache text embeddings".

Dec 02 '25 19:12 ZeTofZone

I’m running into the exact same issue. I've tried a bunch of different setups but I still can't get a concept slider to train with Z-Image, it keeps throwing the same exception every time.

Dec 03 '25 00:12 nonom

I spent some hours debugging and had to change the line 265 in "/toolkit/prompt_utils.py" to make it to work :

text_embeds = embed_list by text_embeds = padded

Like this :

text_embeds = embed_list

text_embeds = padded

Seems to work in my first test, but I didn't know what I have broken elsewhere......

This does also fix batch>1 in regular training. Thanks.

Dec 03 '25 01:12 mookiexl

I'm also running into this issue. RTX 3090 on Z-Image training a Slider lora. I've been trying every single possible combination of batch sizes, and everything else. I even tried changing hidden settings when LLMs were certain they knew the problem. I saw the fix here, but I don't want my lora to be broken or just made semi-crappy by a weird setting. Has this issue been recognized as a common issue?

Dec 04 '25 05:12 Jellybit

The bug isn’t in the training itself. The issue is that there’s a condition that throws an exception, preventing it from even starting. It's probably just a temporary fix, but after trying it out I was able to start playing with the concept slider. Thank you.

Dec 05 '25 23:12 nonom

This issue is caused by the captions not being auto padded. What its referring to is the fact that it cant make a batch (more than 1 image in a training step) because the captions don't have the exact same identical token count for the text encoder. This is something that is automatically handled for all model trainings, but I believe Qwen 3 4b VL makes it a lot more complicated because of how LLM's encode tokens compared to CLIP.

This is also why the issue doesn't happen if you use no captions, or just a trigger word, because all of the images have the same text encoder latent size

Dec 06 '25 13:12 SytanSD

This is for bugs only

Did you already ask in the discord? No - I'm not using discord. You verified that this is a bug and not a feature request or question by asking in the discord? No - I'm not using discord.

Describe the bug

The error message comes from the file "/toolkit/models/base_model.py" line 817/818. I spent some hours debugging and had to change the line 265 in "/toolkit/prompt_utils.py" to make it to work : text_embeds = embed_list by text_embeds = padded Like this :

text_embeds = embed_list

text_embeds = padded

Seems to work in my first test, but I didn't know what I have broken elsewhere......

This did work for me. huge thanks!

While this did work to make the error go away. It seemed to slow down training significantly.

Dec 09 '25 22:12 shyt47

vibe debugging says issue is this SDTrainer.py

duplicates embedding to use when caching

Dec 16 '25 17:12 siraxe

vibe debugging says issue is this SDTrainer.py
duplicates embedding to use when caching

I can say that using that fix along with using batch = 1 and "Unload TE" and changing line 265 in prompt_utils.py by text_embeds = padded, worked for me. BUT, its 10x slower. I pass from 2s/it to 20/it...

Dec 16 '25 23:12 Poukpalaova

Chiming in to say this issue is happening to me as well. Default settings for a ZIT lora w/ adapter. Have no problem with other models, but ZIT with adapter and ZIT De-Turbo throw this exception. Seems like it is a ZIT / Qwen 3 4b VL issue, like SytanSD mentioned.

Dec 17 '25 00:12 JetsonFlyers