diffusers [Qwen-image-edit] Batch Inference Issue / Feature Request

Describe the bug

I am trying to run the batch inference for 'Qwen-Image-Edit-2509', but it will raise inference error

AttributeError                            Traceback (most recent call last)
.venv/lib/python3.10/site-packages/diffusers/pipelines/qwenimage/pipeline_qwenimage_edit_plus.py:633) height = height or calculated_height

AttributeError: 'tuple' object has no attribute 'size'

Reproduction

import os
import torch
from PIL import Image
from diffusers import QwenImageEditPlusPipeline
from diffusers.utils import load_image

pipeline = QwenImageEditPlusPipeline.from_pretrained("Qwen/Qwen-Image-Edit-2509", torch_dtype=torch.bfloat16)
print("pipeline loaded")

pipeline.to('cuda')
pipeline.set_progress_bar_config(disable=None)
image1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"),
image2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"),
prompts = [
    "cinematic photo of a beautiful sunset over mountains, 35mm photograph, film, professional, 4k, highly detailed",
    "cinematic film still of a cat basking in the sun on a roof in Turkey, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain",
]

with torch.inference_mode():
    output = pipeline(
        image=[image1, image2],
        prompt=prompts,
        generator=[torch.manual_seed(0), torch.manual_seed(0)],
        true_cfg_scale=4.0,
        negative_prompt=" ",
        num_inference_steps=40,
        guidance_scale=1.0,
    )
    output_image = output.images

Logs

System Info

🤗 Diffusers version: 0.36.0.dev0
Platform: Linux-6.8.0-63-generic-x86_64-with-glibc2.39
Running on Google Colab?: No
Python version: 3.10.18
PyTorch version (GPU?): 2.8.0+cu128 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Huggingface_hub version: 0.35.3
Transformers version: 4.57.0
Accelerate version: 1.10.1
PEFT version: not installed
Bitsandbytes version: not installed
Safetensors version: 0.6.2
xFormers version: not installed
Accelerator: NVIDIA H100 NVL, 95830 MiB NVIDIA H100 NVL, 95830 MiB NVIDIA H100 NVL, 95830 MiB NVIDIA H100 NVL, 95830 MiB NVIDIA H100 NVL, 95830 MiB NVIDIA H100 NVL, 95830 MiB NVIDIA H100 NVL, 95830 MiB NVIDIA H100 NVL, 95830 MiB

Who can help?

@naykun

Oct 10 '25 03:10 Django-Jiang

Hey @Django-Jiang i justed opened a pull request . Have a look into it.

Oct 10 '25 23:10 SahilCarterr

Thank you very much for your help!

I went through your fix, but I am not sure if that is same as I expected. As this can support multiple-image as condition per generation, it may be complicated to handle (1) 1 image input per sample, (2) multiple image input 1 sample (3) multiple image input per sample (and even other use cases).

Basically, I am trying to test if this pipeline can work with parallel batch inference for speedup as I did in #12459 , but if the speedup is also impossible (sequential dependancy inside model), then the batch inference wrapper for pipeline is not that matters. Any insight about this will be really appreciated!

Oct 11 '25 17:10 Django-Jiang

hey @Django-Jiang I think qwen pipelines don 't work well with mutiple prompts with various prompt lengths - but you should be able to use num_images_per_prompt > 1 though

Oct 21 '25 21:10 yiyixuxu

Thank you very much @yiyixuxu. Maybe this can be feature request expect for potential support.

Oct 21 '25 23:10 Django-Jiang

Hi @sayakpaul and @yiyixuxu,

I’d like to tackle this for the MVP program. I’ve analyzed the tracebacks and the previous discussion in PR #12467.

It seems the core issue is that the pipeline lacks a unified strategy for normalizing image inputs when batch_size > 1. The previous attempt didn't fully account for the ambiguity between "batch of inputs" vs "multiple condition images per single prompt."

My Implementation Plan is to standardize inputs. I will implement a utility method in the pipeline to explicitly normalize inputs into a (batch_size, num_images_per_prompt, c, h, w) tensor structure before passing them to the model. I will ensure the AttributeError is resolved.

I can start reproducing the script now.

Nov 21 '25 08:11 akshan-main

Hi @sayakpaul and @yiyixuxu,

I would like to reiterate my interest in this issue. I have been working extensively on fine-tuning Qwen-Image-Edit(for synthetic data production), and I believe I have identified the architectural root cause that is likely complicating the batch inference logic here.

Through debugging the training pipeline, I found that the Qwen-Image VAE (and the underlying Qwen2-VL architecture) treats all inputs as video, requiring 5D tensors even for static image editing tasks (where Frame=1).

I haven't reproduced this exact AttributeError, but while fine-tuning this model, I discovered the underlying architecture enforces a strict 5D Video-like input format (B, C, T, H, W). The current pipeline likely confuses 'Batch of Images' with 'Sequence of Frames/Conditions', leading to it passing a tuple where a PIL image is expected. I can fix the batch handling logic to ensure inputs are correctly normalized to the 5D format the VAE requires.

I have already implemented manual normalization logic for this 5D requirement in my training scripts. I can port this logic to the pipeline's preprocess method to handle the batch/condition ambiguity and ensure the VAE always receives the correct 5D shape.

I am ready to open a PR for this immediately.

Nov 21 '25 15:11 akshan-main

Hi @Django-Jiang,

I dug into the reproduction script and found the cause of the specific AttributeError: 'tuple' object has no attribute 'size'. It is a small syntax typo in the image loading lines:

The trailing comma here converts the image object into a single-item tuple- image1 = load_image("..."), Removing those commas resolves the immediate crash. However, getting true batch inference working correctly requires more than just that.

As I mentioned in my previous comment, the underlying Qwen-Image-Edit architecture treats all inputs as video (requiring 5D tensors: B, C, F, H, W), even for static images. Standardizing the pipeline to handle a batch of images vs a sequence of frames is the real fix needed here to make batching robust.

I have a working implementation of this normalization logic ready. @sayakpaul & @yiyixuxu

Nov 21 '25 16:11 akshan-main

Hi @Django-Jiang,

I have opened a fix for this in PR #12698.

It resolves the crash you were seeing by:

Fixing the Tuple Bug: The pipeline now properly handles the trailing comma in your load_image line.
Enabling True Batching: I implemented the 5D tensor reshaping required by the Qwen2-VL architecture, so you can now pass image=[img1, img2] and prompt=[p1, p2] for parallel inference.

I verified this on an A100, and it achieves the speedup you were looking for. Let me know if you run into any other issues with it!

Note on Batching Logic

To resolve the ambiguity between "Multi-Image Conditioning" and "Batch Inference", I implemented the following routing logic in encode_prompt:

Single String Prompt (prompt="string"):
- Behavior: Joint Condition. The pipeline treats all provided images as a single context for one generation task.
- Use Case: Style transfer or merging elements from multiple reference images.
List of Prompts (prompt=["s1", "s2"]):
- Behavior: Parallel Batch. The pipeline maps images to prompts 1-to-1.
- Use Case: Processing a dataset (e.g., editing 50 different images with 50 different instructions at once

Nov 22 '25 08:11 akshan-main

Update on PR #12698

Just pushed a logic update. While the 5D tensor fix handled the immediate architecture mismatch, I refined how we handle mixed-resolution batches to be cleaner. Overall: Added a new feature through which qwen-image-edit users can now batch the input to the model

Instead of naive padding(I added this too, but thought I could improvise) (which messes with the canvas/generation quality), I switched to a proper resizing strategy:

Uniform/Single inputs: We preserve the original aspect ratio and just snap to the nearest 32px (or prioritize user input for height/width).
Mixed batches: We resize to a standard target (or user height/width) to guarantee the tensors stack cleanly.

This keeps the pipeline efficient for small inputs (no unnecessary upscaling) while handling large batch variations robustly. I have thoroughly tested these functionalities and they work as intended.

Would say this has been my most productive weekend! Thanks for the opportunity!

Nov 24 '25 06:11 akshan-main

Hey @sayakpaul, it would be great if my PR could be reviewed! Also, can I consider this as having been assigned to me

Dec 08 '25 05:12 akshan-main

Done! It will be reviewed soon. Apologies for the delay.

Dec 08 '25 05:12 sayakpaul

hey @akshan-main sorry I didn't look at the conversation here sooner the reason we don't support batch is because qwen model does not handle variant prompt length (i.e. don't pass the correct mask) so the PR is not at all the direction we need

Dec 08 '25 08:12 yiyixuxu

Hi @yiyixuxu, I’ve fixed the variant prompt length issue by padding in this PR, and batching now works correctly. I also validated it in a notebook and can share the link if that helps.

In my PR, the pipeline pads both prompt_embeds and prompt_embeds_mask to a common length for batched prompts. For each prompt in the list, I compute its embeddings and mask separately, then right-pad them to the maximum sequence length in the batch before concatenation. Now, the Qwen model always receives uniform (batch_size, max_len, hidden_dim) embeddings and a matching attention mask, so variant prompt lengths no longer cause masking issues during batch inference. (In other words, fixing variant prompt lengths was necessary but not sufficient on its own. The rest of the changes in this PR address the remaining batching and input-routing issues so that batch inference now works correctly in practice.)

I have also mentioned this in my PR description.

Dec 08 '25 08:12 akshan-main