ai-toolkit icon indicating copy to clipboard operation
ai-toolkit copied to clipboard

Tensor size mismatch while training Qwen Image Edit 2509 with batch size > 1

Open pft-JoeyYang opened this issue 1 month ago • 2 comments

This is for bugs only

Did you already ask in the discord?

Yes/No

You verified that this is a bug and not a feature request or question by asking in the discord?

Yes/No

Describe the bug

Got the following error while training Qwen-Image-Edit-2509 with batch_size > 1:

========================================
Result:
 - 0 completed jobs
 - 1 failure
========================================
Traceback (most recent call last):
Traceback (most recent call last):
  File "/data/ai-toolkit/run.py", line 120, in <module>
  File "/data/ai-toolkit/run.py", line 120, in <module>
        main()main()
  File "/data/ai-toolkit/run.py", line 108, in main
  File "/data/ai-toolkit/run.py", line 108, in main
        raise eraise e
  File "/data/ai-toolkit/run.py", line 96, in main
  File "/data/ai-toolkit/run.py", line 96, in main
        job.run()job.run()
  File "/data/ai-toolkit/jobs/ExtensionJob.py", line 22, in run
  File "/data/ai-toolkit/jobs/ExtensionJob.py", line 22, in run
        process.run()process.run()
  File "/data/ai-toolkit/jobs/process/BaseSDTrainProcess.py", line 2208, in run
  File "/data/ai-toolkit/jobs/process/BaseSDTrainProcess.py", line 2208, in run
        batch = next(dataloader_iterator)batch = next(dataloader_iterator)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/ai-toolkit/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 733, in __next__
  File "/data/ai-toolkit/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 733, in __next__
        data = self._next_data()data = self._next_data()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/ai-toolkit/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1515, in _next_data
  File "/data/ai-toolkit/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1515, in _next_data
        return self._process_data(data, worker_id)return self._process_data(data, worker_id)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/ai-toolkit/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1550, in _process_data
  File "/data/ai-toolkit/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1550, in _process_data
        data.reraise()data.reraise()
  File "/data/ai-toolkit/venv/lib/python3.12/site-packages/torch/_utils.py", line 750, in reraise
  File "/data/ai-toolkit/venv/lib/python3.12/site-packages/torch/_utils.py", line 750, in reraise
        raise exceptionraise exception
RuntimeErrorRuntimeError: : Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/data/ai-toolkit/venv/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
           ^^^^^^^^^^^^^^^^^^^^
  File "/data/ai-toolkit/venv/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
    return self.collate_fn(data)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/data/ai-toolkit/toolkit/data_loader.py", line 642, in dto_collation
    batch = DataLoaderBatchDTO(
            ^^^^^^^^^^^^^^^^^^^
  File "/data/ai-toolkit/toolkit/data_transfer_object/data_loader.py", line 306, in __init__
    raise e
  File "/data/ai-toolkit/toolkit/data_transfer_object/data_loader.py", line 180, in __init__
    self.control_tensor = torch.cat([x.unsqueeze(0) for x in control_tensors])
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 1851 but got size 1819 for tensor number 1 in the list.
Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/data/ai-toolkit/venv/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
           ^^^^^^^^^^^^^^^^^^^^
  File "/data/ai-toolkit/venv/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
    return self.collate_fn(data)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/data/ai-toolkit/toolkit/data_loader.py", line 642, in dto_collation
    batch = DataLoaderBatchDTO(
            ^^^^^^^^^^^^^^^^^^^
  File "/data/ai-toolkit/toolkit/data_transfer_object/data_loader.py", line 306, in __init__
    raise e
  File "/data/ai-toolkit/toolkit/data_transfer_object/data_loader.py", line 180, in __init__
    self.control_tensor = torch.cat([x.unsqueeze(0) for x in control_tensors])
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 1851 but got size 1819 for tensor number 1 in the list.

The following is my training config:

job: "extension"
config:
  name: "qwen_image_edit_v3.7"
  process:
    - type: "diffusion_trainer"
      training_folder: "/data/ai-toolkit/output"
      sqlite_db_path: "./aitk_db.db"
      device: "cuda"
      trigger_word: null
      performance_log_every: 10
      network:
        type: "lora"
        linear: 32
        linear_alpha: 32
        conv: 16
        conv_alpha: 16
        lokr_full_rank: true
        lokr_factor: -1
        network_kwargs:
          ignore_if_contains: []
      save:
        dtype: "bf16"
        save_every: 250
        max_step_saves_to_keep: 4
        save_format: "diffusers"
        push_to_hub: false
      datasets:
        - folder_path: "/data/ai-toolkit/datasets/target_v2_0"
          mask_path: null
          mask_min_value: 0.1
          caption_ext: "txt"
          caption_dropout_rate: 0.05
          cache_latents_to_disk: false
          is_reg: false
          network_weight: 1
          resolution:
            - 1024
            - 512
            - 768
          controls: []
          shrink_video_to_frames: true
          num_frames: 1
          do_i2v: true
          flip_x: false
          flip_y: false
          control_path_1: "/data/ai-toolkit/datasets/masked_v2_0"
      train:
        batch_size: 8
        bypass_guidance_embedding: false
        steps: 6000
        gradient_accumulation: 1
        train_unet: true
        train_text_encoder: false
        gradient_checkpointing: true
        noise_scheduler: "flowmatch"
        optimizer: "adamw8bit"
        timestep_type: "weighted"
        content_or_style: "balanced"
        optimizer_params:
          weight_decay: 0.0001
        unload_text_encoder: false
        cache_text_embeddings: true
        lr: 0.0001
        ema_config:
          use_ema: false
          ema_decay: 0.99
        skip_first_sample: false
        force_first_sample: false
        disable_sampling: false
        dtype: "bf16"
        diff_output_preservation: false
        diff_output_preservation_multiplier: 1
        diff_output_preservation_class: "person"
        switch_boundary_every: 1
        loss_type: "mse"
      model:
        name_or_path: "Qwen/Qwen-Image-Edit-2509"
        quantize: true
        qtype: "qfloat8"
        quantize_te: true
        qtype_te: "qfloat8"
        arch: "qwen_image_edit_plus"
        low_vram: false
        model_kwargs:
          match_target_res: false
        layer_offloading: false
        layer_offloading_text_encoder_percent: 1
        layer_offloading_transformer_percent: 1
        neg: ""
        seed: 42
        walk_seed: true
        guidance_scale: 4
        sample_steps: 25
        num_frames: 1
        fps: 1
meta:
  name: "[name]"
  version: "1.0"

pft-JoeyYang avatar Oct 29 '25 05:10 pft-JoeyYang

same error

Keith-Hon avatar Nov 06 '25 23:11 Keith-Hon

Hi, I got the same error and figured out the following. This is because you are using control images with different resolutions in a batch of 8. Meaning, the column dimension of concatenated image tensors are not the same. You either need to use batch size 1 or resize the controls to the same resolution.

aminamazlin avatar Dec 06 '25 15:12 aminamazlin