Fix QwenImage txt_seq_lens handling
What does this PR do?
- Removes the redundant
txt_seq_lensplumbing from all QwenImage pipelines and modular steps; the transformer now infers text length from encoder inputs/masks and validates optional overrides. - Builds a lightweight broadcastable attention mask from
encoder_hidden_states_maskinside the double-stream attention, avoiding fullseq_len²masks while keeping padding tokens masked. - Adjusts QwenImage Transformer/ControlNet RoPE to take a single text length and documents the fallback behavior.
- Adds regression tests to ensure short
txt_seq_lensvalues and encoder masks are handled safely.
Fixes #12344
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Did you read the contributor guideline?
- [ ] Did you read our philosophy doc (important for complex PRs)?
- [ ] Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [ ] Did you write any new necessary tests?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
just a few comments, not a full review:
- there is some overlap with #12655
- this code has the same issue mentioned in #12655 (expecting boolean semantics in a FloatTensor - but float attention masks are interpreted differently)
- Could you clarify what the purpose of this PR is?
If the purpose is to remove the
txt_seq_lensparameters, and infer the sequence lengths from the attention mask: why is it still a parameter of the transformer model? If the purpose is towards passing sequence lengths to the attention dispatch (see https://github.com/huggingface/diffusers/issues/12344#issuecomment-3516800940), the sequence lengths for each batch sample must be inferred from the mask and passed to the transformer blocks, not only the max sequence length across all batch samples for RoPE
thanks @dxqb the idea was to remove txt_seq_lens all together and work with any mask pattern
https://github.com/huggingface/diffusers/pull/12655 provides some benchmarks on the speed, as well. Possible to provide them here too? @kashif
cc @naykun can you take a look here too?
Hey @kashif! I've prepared a documentation update with a new Performance section covering:
- Attention backend benchmarks (from your tests)
- torch.compile speedup (~2.4x)
- Variable-length prompt handling with CFG
I am also mentioning the gist with the scripts I used
Can you double check?
Is there anything missing, or something else I can help with?
Thank you so much for this excellent PR! It’s clean, well-structured, and addresses several long-standing issues. I’ve left a few questions —we can discuss them further.
thanks @dxqb can you test now?
thanks @dxqb can you test now?
sure, will do. could you have a look at this PR please? https://github.com/huggingface/diffusers/pull/12870 It is based on yours. I've made minor changes to your code:
@dxqb you can also send a PR to this branch or fix this PR directly
@dxqb you can also send a PR to this branch or fix this PR directly
I'll send the small changes in your PR that I needed (like support for text sequence length per sample), but keep the PR overall separate - it's a separate feature