What does this PR do?

Removes the redundant txt_seq_lens plumbing from all QwenImage pipelines and modular steps; the transformer now infers text length from encoder inputs/masks and validates optional overrides.
Builds a lightweight broadcastable attention mask from encoder_hidden_states_mask inside the double-stream attention, avoiding full seq_len² masks while keeping padding tokens masked.
Adjusts QwenImage Transformer/ControlNet RoPE to take a single text length and documents the fallback behavior.
Adds regression tests to ensure short txt_seq_lens values and encoder masks are handled safely.

Fixes #12344

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ ] Did you read the contributor guideline?
[ ] Did you read our philosophy doc (important for complex PRs)?
[ ] Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Nov 23 '25 18:11 kashif

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Nov 23 '25 18:11 HuggingFaceDocBuilderDev

just a few comments, not a full review:

there is some overlap with #12655
this code has the same issue mentioned in #12655 (expecting boolean semantics in a FloatTensor - but float attention masks are interpreted differently)
Could you clarify what the purpose of this PR is? If the purpose is to remove the txt_seq_lens parameters, and infer the sequence lengths from the attention mask: why is it still a parameter of the transformer model? If the purpose is towards passing sequence lengths to the attention dispatch (see https://github.com/huggingface/diffusers/issues/12344#issuecomment-3516800940), the sequence lengths for each batch sample must be inferred from the mask and passed to the transformer blocks, not only the max sequence length across all batch samples for RoPE

Nov 29 '25 06:11 dxqb

thanks @dxqb the idea was to remove txt_seq_lens all together and work with any mask pattern

Nov 29 '25 15:11 kashif

https://github.com/huggingface/diffusers/pull/12655 provides some benchmarks on the speed, as well. Possible to provide them here too? @kashif

Dec 08 '25 05:12 sayakpaul

some benchmarks with various backends:

code: benchmark_backends_qwen.py

Dec 08 '25 10:12 kashif

cc @naykun can you take a look here too?

Dec 09 '25 18:12 yiyixuxu

Hey @kashif! I've prepared a documentation update with a new Performance section covering:

Attention backend benchmarks (from your tests)
torch.compile speedup (~2.4x)
Variable-length prompt handling with CFG

I am also mentioning the gist with the scripts I used

Can you double check?

Is there anything missing, or something else I can help with?

Dec 11 '25 22:12 cdutr

Thank you so much for this excellent PR! It’s clean, well-structured, and addresses several long-standing issues. I’ve left a few questions —we can discuss them further.

Dec 12 '25 06:12 naykun

thanks @dxqb can you test now?

Dec 21 '25 16:12 kashif

thanks @dxqb can you test now?

sure, will do. could you have a look at this PR please? https://github.com/huggingface/diffusers/pull/12870 It is based on yours. I've made minor changes to your code:

Dec 21 '25 20:12 dxqb

@dxqb you can also send a PR to this branch or fix this PR directly

Dec 21 '25 22:12 kashif

@dxqb you can also send a PR to this branch or fix this PR directly

I'll send the small changes in your PR that I needed (like support for text sequence length per sample), but keep the PR overall separate - it's a separate feature

Dec 21 '25 23:12 dxqb

Fix QwenImage txt_seq_lens handling

What does this PR do?

Before submitting

Who can review?