[Qwen-image-edit] Batch Inference Issue / Feature Request
Describe the bug
I am trying to run the batch inference for 'Qwen-Image-Edit-2509', but it will raise inference error
AttributeError Traceback (most recent call last)
.venv/lib/python3.10/site-packages/diffusers/pipelines/qwenimage/pipeline_qwenimage_edit_plus.py:633) height = height or calculated_height
AttributeError: 'tuple' object has no attribute 'size'
Reproduction
import os
import torch
from PIL import Image
from diffusers import QwenImageEditPlusPipeline
from diffusers.utils import load_image
pipeline = QwenImageEditPlusPipeline.from_pretrained("Qwen/Qwen-Image-Edit-2509", torch_dtype=torch.bfloat16)
print("pipeline loaded")
pipeline.to('cuda')
pipeline.set_progress_bar_config(disable=None)
image1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"),
image2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"),
prompts = [
"cinematic photo of a beautiful sunset over mountains, 35mm photograph, film, professional, 4k, highly detailed",
"cinematic film still of a cat basking in the sun on a roof in Turkey, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain",
]
with torch.inference_mode():
output = pipeline(
image=[image1, image2],
prompt=prompts,
generator=[torch.manual_seed(0), torch.manual_seed(0)],
true_cfg_scale=4.0,
negative_prompt=" ",
num_inference_steps=40,
guidance_scale=1.0,
)
output_image = output.images
Logs
System Info
- 🤗 Diffusers version: 0.36.0.dev0
- Platform: Linux-6.8.0-63-generic-x86_64-with-glibc2.39
- Running on Google Colab?: No
- Python version: 3.10.18
- PyTorch version (GPU?): 2.8.0+cu128 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.35.3
- Transformers version: 4.57.0
- Accelerate version: 1.10.1
- PEFT version: not installed
- Bitsandbytes version: not installed
- Safetensors version: 0.6.2
- xFormers version: not installed
- Accelerator: NVIDIA H100 NVL, 95830 MiB NVIDIA H100 NVL, 95830 MiB NVIDIA H100 NVL, 95830 MiB NVIDIA H100 NVL, 95830 MiB NVIDIA H100 NVL, 95830 MiB NVIDIA H100 NVL, 95830 MiB NVIDIA H100 NVL, 95830 MiB NVIDIA H100 NVL, 95830 MiB
Who can help?
@naykun
Hey @Django-Jiang i justed opened a pull request . Have a look into it.
Thank you very much for your help!
I went through your fix, but I am not sure if that is same as I expected. As this can support multiple-image as condition per generation, it may be complicated to handle (1) 1 image input per sample, (2) multiple image input 1 sample (3) multiple image input per sample (and even other use cases).
Basically, I am trying to test if this pipeline can work with parallel batch inference for speedup as I did in #12459 , but if the speedup is also impossible (sequential dependancy inside model), then the batch inference wrapper for pipeline is not that matters. Any insight about this will be really appreciated!
hey @Django-Jiang I think qwen pipelines don 't work well with mutiple prompts with various prompt lengths - but you should be able to use num_images_per_prompt > 1 though
Thank you very much @yiyixuxu. Maybe this can be feature request expect for potential support.
Hi @sayakpaul and @yiyixuxu,
I’d like to tackle this for the MVP program. I’ve analyzed the tracebacks and the previous discussion in PR #12467.
It seems the core issue is that the pipeline lacks a unified strategy for normalizing image inputs when batch_size > 1. The previous attempt didn't fully account for the ambiguity between "batch of inputs" vs "multiple condition images per single prompt."
My Implementation Plan is to standardize inputs. I will implement a utility method in the pipeline to explicitly normalize inputs into a (batch_size, num_images_per_prompt, c, h, w) tensor structure before passing them to the model. I will ensure the AttributeError is resolved.
I can start reproducing the script now.
Hi @sayakpaul and @yiyixuxu,
I would like to reiterate my interest in this issue. I have been working extensively on fine-tuning Qwen-Image-Edit(for synthetic data production), and I believe I have identified the architectural root cause that is likely complicating the batch inference logic here.
Through debugging the training pipeline, I found that the Qwen-Image VAE (and the underlying Qwen2-VL architecture) treats all inputs as video, requiring 5D tensors even for static image editing tasks (where Frame=1).
I haven't reproduced this exact AttributeError, but while fine-tuning this model, I discovered the underlying architecture enforces a strict 5D Video-like input format (B, C, T, H, W). The current pipeline likely confuses 'Batch of Images' with 'Sequence of Frames/Conditions', leading to it passing a tuple where a PIL image is expected. I can fix the batch handling logic to ensure inputs are correctly normalized to the 5D format the VAE requires.
I have already implemented manual normalization logic for this 5D requirement in my training scripts. I can port this logic to the pipeline's preprocess method to handle the batch/condition ambiguity and ensure the VAE always receives the correct 5D shape.
I am ready to open a PR for this immediately.
Hi @Django-Jiang,
I dug into the reproduction script and found the cause of the specific AttributeError: 'tuple' object has no attribute 'size'. It is a small syntax typo in the image loading lines:
The trailing comma here converts the image object into a single-item tuple- image1 = load_image("..."), Removing those commas resolves the immediate crash. However, getting true batch inference working correctly requires more than just that.
As I mentioned in my previous comment, the underlying Qwen-Image-Edit architecture treats all inputs as video (requiring 5D tensors: B, C, F, H, W), even for static images. Standardizing the pipeline to handle a batch of images vs a sequence of frames is the real fix needed here to make batching robust.
I have a working implementation of this normalization logic ready. @sayakpaul & @yiyixuxu
Hi @Django-Jiang,
I have opened a fix for this in PR #12698.
It resolves the crash you were seeing by:
- Fixing the Tuple Bug: The pipeline now properly handles the trailing comma in your
load_imageline. - Enabling True Batching: I implemented the 5D tensor reshaping required by the Qwen2-VL architecture, so you can now pass
image=[img1, img2]andprompt=[p1, p2]for parallel inference.
I verified this on an A100, and it achieves the speedup you were looking for. Let me know if you run into any other issues with it!
Note on Batching Logic
To resolve the ambiguity between "Multi-Image Conditioning" and "Batch Inference", I implemented the following routing logic in encode_prompt:
-
Single String Prompt (
prompt="string"):- Behavior: Joint Condition. The pipeline treats all provided images as a single context for one generation task.
- Use Case: Style transfer or merging elements from multiple reference images.
-
List of Prompts (
prompt=["s1", "s2"]):- Behavior: Parallel Batch. The pipeline maps images to prompts 1-to-1.
- Use Case: Processing a dataset (e.g., editing 50 different images with 50 different instructions at once
Update on PR #12698
Just pushed a logic update. While the 5D tensor fix handled the immediate architecture mismatch, I refined how we handle mixed-resolution batches to be cleaner. Overall: Added a new feature through which qwen-image-edit users can now batch the input to the model
Instead of naive padding(I added this too, but thought I could improvise) (which messes with the canvas/generation quality), I switched to a proper resizing strategy:
- Uniform/Single inputs: We preserve the original aspect ratio and just snap to the nearest 32px (or prioritize user input for
height/width). - Mixed batches: We resize to a standard target (or user
height/width) to guarantee the tensors stack cleanly.
This keeps the pipeline efficient for small inputs (no unnecessary upscaling) while handling large batch variations robustly. I have thoroughly tested these functionalities and they work as intended.
Would say this has been my most productive weekend! Thanks for the opportunity!
Hey @sayakpaul, it would be great if my PR could be reviewed! Also, can I consider this as having been assigned to me
Done! It will be reviewed soon. Apologies for the delay.
hey @akshan-main sorry I didn't look at the conversation here sooner the reason we don't support batch is because qwen model does not handle variant prompt length (i.e. don't pass the correct mask) so the PR is not at all the direction we need
Hi @yiyixuxu, I’ve fixed the variant prompt length issue by padding in this PR, and batching now works correctly. I also validated it in a notebook and can share the link if that helps.
In my PR, the pipeline pads both prompt_embeds and prompt_embeds_mask to a common length for batched prompts. For each prompt in the list, I compute its embeddings and mask separately, then right-pad them to the maximum sequence length in the batch before concatenation. Now, the Qwen model always receives uniform (batch_size, max_len, hidden_dim) embeddings and a matching attention mask, so variant prompt lengths no longer cause masking issues during batch inference. (In other words, fixing variant prompt lengths was necessary but not sufficient on its own. The rest of the changes in this PR address the remaining batching and input-routing issues so that batch inference now works correctly in practice.)
I have also mentioned this in my PR description.