What does this PR do?

Fixes the QwenImage encoder to properly apply encoder_hidden_states_mask when passed to the model. Previously, the mask parameter was accepted but ignored, causing padding tokens to incorrectly influence attention computation.

Changes

Attention mask application: Modified QwenDoubleStreamAttnProcessor2_0 to create a 2D attention mask from the 1D encoder_hidden_states_mask, properly masking text padding tokens while keeping all image tokens unmasked
RoPE adjustment: Updated positional embedding computation to use the full padded sequence length when a mask is present, ensuring correct position indices
Tests: Added comprehensive tests validating that:
- Padding tokens are properly isolated and don't affect outputs
- Masked outputs differ significantly from unmasked outputs
Benchmarks: Included performance analysis showing acceptable overhead (<20% for inference, ~19% for training scenarios)

Impact

This fix enables proper Classifier-Free Guidance (CFG) batching with variable-length text sequences, which is common when batching conditional and unconditional prompts together.

Benchmark Results

Scenario	Latency (ms)	Peak Memory (MB)	Throughput (iter/s)
Baseline (no mask)	11.68 ± 0.23	301.5	84.70
Mask all-ones (no padding)	12.01 ± 0.26	301.5	82.34
Mask with padding (CFG)	13.86 ± 0.24	301.5	71.42

Overhead: +2.8% for mask processing without padding, +18.7% with actual padding (realistic CFG scenario)

The higher overhead with padding is expected and acceptable as it represents the cost of properly handling variable-length sequences in batched inference. This is a necessary correctness fix rather than an optimization. Test ran on RTX 4070 12GB.

Fixes #12294

Before submitting

[x] This PR fixes a bug in the code
[x] This PR adds tests that verify the fix
[x] This PR includes benchmarks demonstrating performance impact
[x] Did you write any new necessary tests?

Who can review?

@yiyixuxu @sayakpaul - Would appreciate your review, especially regarding the benchmarking approach. I used a custom benchmark rather than BenchmarkMixin because:

This tests a specific bug fix (mask application) rather than optimization strategies
The fix uses synthetic models to isolate the mask handling logic
Standard benchmarks focus on pretrained model performance with different quantization/offloading strategies
The metrics needed are different (latency distribution, throughput) vs standard format (compile/plain time comparison)

Note: The benchmark file is named benchmarking_qwenimage_mask.py (with "benchmarking" prefix) rather than benchmark_qwenimage_mask.py to prevent it from being picked up by run_all.py, since it doesn't use BenchmarkMixin and produces a different CSV schema. If you prefer, I can adapt it to use the standard format instead.

Happy to adjust the approach if you have suggestions!

Nov 14 '25 02:11 cdutr

@cdutr it's great that you have also included the benchmarking script for fullest transparency. But we can remove that from this PR and instead have that as a GitHub gist.

The benchmark numbers make sense to me. Some comments:

Could we also check the performance with torch.compile?
Could we also see some image outputs with and without the changes introduced in this PR?

Also, I think a natural next step would be see how well this performs when combined with FA varlen. WDYT?

@naykun what do you think about the changes?

Nov 20 '25 03:11 sayakpaul

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Nov 20 '25 03:11 HuggingFaceDocBuilderDev

Thanks @sayakpaul! I removed the benchmark script, moved all tests to this gist.

torch.compile test

Also tested the performance with torch.compile, and results were similar, the details are below.

Tested on NVIDIA A100 80GB PCIe:

Without compile: 4.70ms per iteration
With compile:    1.93ms per iteration
Speedup:         2.43x
Compilation overhead: 7.25s (one-time cost)

Also validated on RTX 4050 6GB (laptop) with similar results (2.38x speedup). The mask implementation is fully compatible with torch.compile.

Image outputs

Tested End-to-end image generation: Successfully generated images using QwenImagePipeline and pipeline runs without errors, here is the output generated:

FA Varlen

FA varlen is the natural next step, yes! I'm interested in working on it. Should I keep iterating in this PR, or should we merge it and create a new issue?

The mask infrastructure this PR adds would translate well to varlen, instead of masking padding tokens, we'd pack only valid tokens using the same sequence length information

Nov 22 '25 03:11 cdutr

Thanks for the results! Looks quite nice.

FA varlen is the natural next step, yes! I'm interested in working on it. Should I keep iterating in this PR, or should we merge it and create a new issue?

I think it's fine to first merge this PR and then we work on it afterwards. We're adding easier support for Sage and FA2 in this PR: https://github.com/huggingface/diffusers/pull/12439, so after that's merged, it will be quite easy to work on that (thanks to the kernels lib).

Could we also check if the outputs deviate with and without the masks, i.e., the outputs we get on main and this PR branch?

Nov 22 '25 04:11 sayakpaul

@dxqb would you maybe interested in checking this PR out as well?

Nov 22 '25 04:11 sayakpaul

@naykun WDYT about this PR?

Dec 08 '25 05:12 sayakpaul

Hey. I would like to work on the last inputs and comments. I was off last week attending a conference but should go back to it tomorrow. Any additional feedback or request is appreciated, of course

Dec 08 '25 07:12 cdutr

Hi, I am back, will reply the points below:

@dxqb

Comment 1: encoder_hidden_states_mask Type Hints

Updated to use torch.BoolTensor type hint and added documentation clarifying that only boolean masks are supported. Float masks have different semantics in PyTorch SDPA (0.0 = not masked, False = masked), and supporting both would require additional conversion logic. Bool masks cover the standard use case of excluding padding tokens.

Comment 2: 2D Attention Mask Optimization

Implemented. Changed from generating full [batch, seq_len, seq_len] masks to broadcasting-friendly [batch, 1, 1, seq_len] shape. This significantly reduces memory usage (e.g., from 73,728 to 384 elements for batch=2, seq_len=192) while maintaining correctness through broadcasting in SDPA.

@sayakpaul

torch.compile Performance

I've tested with torch.compile() and the results show no performance regression:

Without compile:

With mask: 0.0439s per iteration
Without mask: 0.0438s per iteration

With compile:

With mask: 0.0437s per iteration
Without mask: 0.0436s per iteration

The mask implementation is compile-friendly and maintains performance. See full results in the gist.

Output Comparison (main vs PR branch)

I ran comprehensive tests comparing outputs between main and this PR branch:

1. Numerical Comparison:

PR branch: Mean difference = 1.08e-04 (mask functional)
Main branch: Mean difference = 0.00e+00 (mask ignored)

2. Image Generation Comparison: Using batched prompts with different lengths (natural padding scenario):

Prompt: "A cat" (7 tokens) batched with longer prompt (14 tokens) creates 7 padding tokens
With mask (PR): Padding properly excluded from attention
Without mask (simulating bug): Padding treated as real tokens
Result: 87.17% of pixels differ (mean absolute difference: 9.35/255)

With mask:

Without mask:

Difference:

3. Main Branch Status: Main branch crashes when using masks with batched variable-length prompts:

RuntimeError: The size of tensor a (14) must match the size of tensor b (7)

This is the exact bug this PR fixes.

All test scripts and results available in the gist: https://gist.github.com/cdutr/bea337e4680268168550292d7819dc2f

Dec 10 '25 05:12 cdutr

Thanks for your work here!

Do you think there might some overlap between https://github.com/huggingface/diffusers/pull/12702 and this PR?

Dec 11 '25 03:12 sayakpaul

@cdutr yes its somewhat duplicate work to what I have, I can remove the stuff from my PR, and then you can potentially point your PR to mine with these changes and that way we can merge both things? what do you think?

Dec 11 '25 08:12 kashif

Hey @kashif, the comparisons on your PR are really nice! Yes, I can point my PR to yours. Let me know once you've removed the overlapping parts and I'll rebase on top of your branch.

Dec 11 '25 11:12 cdutr

@kashif Is it ok we just go with your PR and make @cdutr a co-author in your PR instead? the PR is already under review for qwen team, it is a big PR and will take some work to review & test for them. Let's not make it hadrer than necessarily

@cdutr sorry this happened but we are really impressed with work here and will do the MVP rewards accordingly :)

Dec 11 '25 15:12 yiyixuxu

Adding on top of @yiyixuxu,

@cdutr since you have already set up the benchmarking for this setup, maybe you could contribute that @kashif's PR as well? Maybe we could document all of that inside https://huggingface.co/docs/diffusers/main/en/api/pipelines/qwenimage?

Again, we're sorry that this happened. But regardless, we would like to honor your contributions and grant you the MVP rewards as Yiyi mentioned (after the PR is merged).

Dec 11 '25 15:12 sayakpaul

Thanks a lot for the MVP rewards, I really appreciate it and I'm really enjoying working on this project.

It makes total sense to move forward with merging the other PR since it's already in review. For sure, I can contribute the benchmarks and documentation there. I’ll join the discussion there

Dec 11 '25 15:12 cdutr

@cdutr you should be able to commit dir to my branch

Dec 11 '25 18:12 kashif

@sayakpaul, I will close this PR, since we are working in the other one, ok?

Dec 18 '25 22:12 cdutr

Yes, fine. Once again, thanks for your contributions! I will get back to you regarding the rewards of MVP shortly. We will collaborate more, for sure :)

Dec 19 '25 00:12 sayakpaul

Fix qwen encoder hidden states mask

What does this PR do?

Changes

Impact

Benchmark Results

Before submitting

Who can review?

torch.compile test

Image outputs

FA Varlen

@dxqb

Comment 1: encoder_hidden_states_mask Type Hints

Comment 2: 2D Attention Mask Optimization

@sayakpaul

torch.compile Performance

Output Comparison (main vs PR branch)