diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

Fix QwenImage txt_seq_lens handling

Open kashif opened this issue 1 month ago • 8 comments

What does this PR do?

  • Removes the redundant txt_seq_lens plumbing from all QwenImage pipelines and modular steps; the transformer now infers text length from encoder inputs/masks and validates optional overrides.
  • Builds a lightweight broadcastable attention mask from encoder_hidden_states_mask inside the double-stream attention, avoiding full seq_len² masks while keeping padding tokens masked.
  • Adjusts QwenImage Transformer/ControlNet RoPE to take a single text length and documents the fallback behavior.
  • Adds regression tests to ensure short txt_seq_lens values and encoder masks are handled safely.

Fixes #12344

Before submitting

  • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [ ] Did you read the contributor guideline?
  • [ ] Did you read our philosophy doc (important for complex PRs)?
  • [ ] Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
  • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • [ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

kashif avatar Nov 23 '25 18:11 kashif

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

just a few comments, not a full review:

  • there is some overlap with #12655
  • this code has the same issue mentioned in #12655 (expecting boolean semantics in a FloatTensor - but float attention masks are interpreted differently)
  • Could you clarify what the purpose of this PR is? If the purpose is to remove the txt_seq_lens parameters, and infer the sequence lengths from the attention mask: why is it still a parameter of the transformer model? If the purpose is towards passing sequence lengths to the attention dispatch (see https://github.com/huggingface/diffusers/issues/12344#issuecomment-3516800940), the sequence lengths for each batch sample must be inferred from the mask and passed to the transformer blocks, not only the max sequence length across all batch samples for RoPE

dxqb avatar Nov 29 '25 06:11 dxqb

thanks @dxqb the idea was to remove txt_seq_lens all together and work with any mask pattern

kashif avatar Nov 29 '25 15:11 kashif

https://github.com/huggingface/diffusers/pull/12655 provides some benchmarks on the speed, as well. Possible to provide them here too? @kashif

sayakpaul avatar Dec 08 '25 05:12 sayakpaul

some benchmarks with various backends:

code: benchmark_backends_qwen.py

backend_benchmark_complete

kashif avatar Dec 08 '25 10:12 kashif

cc @naykun can you take a look here too?

yiyixuxu avatar Dec 09 '25 18:12 yiyixuxu

Hey @kashif! I've prepared a documentation update with a new Performance section covering:

  • Attention backend benchmarks (from your tests)
  • torch.compile speedup (~2.4x)
  • Variable-length prompt handling with CFG

I am also mentioning the gist with the scripts I used

Can you double check?

Is there anything missing, or something else I can help with?

cdutr avatar Dec 11 '25 22:12 cdutr

Thank you so much for this excellent PR! It’s clean, well-structured, and addresses several long-standing issues. I’ve left a few questions —we can discuss them further.

naykun avatar Dec 12 '25 06:12 naykun

thanks @dxqb can you test now?

kashif avatar Dec 21 '25 16:12 kashif

thanks @dxqb can you test now?

sure, will do. could you have a look at this PR please? https://github.com/huggingface/diffusers/pull/12870 It is based on yours. I've made minor changes to your code:

dxqb avatar Dec 21 '25 20:12 dxqb

@dxqb you can also send a PR to this branch or fix this PR directly

kashif avatar Dec 21 '25 22:12 kashif

@dxqb you can also send a PR to this branch or fix this PR directly

I'll send the small changes in your PR that I needed (like support for text sequence length per sample), but keep the PR overall separate - it's a separate feature

dxqb avatar Dec 21 '25 23:12 dxqb