diffusers Qwen Image prompt encoding is not padding to max seq len

Describe the bug

The pipeline method for QwenImagePipeline.encode_prompts is not padding correctly; it's padding by the longest sequence length in the batch, which leaves very very short embeds that are out of distribution for the models' training set.

The padding should remain at 1024 tokens even after the system prompt is dropped. The attention mask has to be expanded too.

Reproduction

Execute QwenImagePipeline.encode_prompts() and check resulting shape.

Logs

prompt_embeds.shape=torch.Size([1, 1024, 3584]), prompt_embeds_mask.shape=torch.Size([1, 1024])

^ after fixing.


prompt_embeds.shape=torch.Size([1, 5, 3584]) prompt_embeds_mask.shape=torch.Size([1, 5])

^ before.

prompt was simply minecraft

this leads to extremely high loss at training time unless very-long prompts are used.

at inference time, it causes patch embed artifacts because the RoPE is not accustomed to these positions.

System Info

Latest git main.

Who can help?

No response

Aug 05 '25 13:08 bghira

A fix should be easy to add quickly if this is a problem, but I'll cc @naykun from Qwen team to take a look if this is intended or unexpected. I do see us not making use of max_sequence_length, so most likely unexpected

Aug 05 '25 13:08 a-r-r-o-w

~~well, loss is at 3.6 without the fix, and normal range with it 🤔~~

the loss is due to my use of the VAE, not the text embed sequence length.

the text embed sequence length merely promotes visual artifacts and burnt learning.

Aug 05 '25 13:08 bghira

the problem is that because the text positions and img positions share the same RoPE (tricky) i can't give the text embeds purely the max sequence length, because then there are no positions left for image tokens. it definitely needs some answers from Qwen team, maybe we just can't use short training captions on this model at all.

Aug 05 '25 13:08 bghira

if we run the pipeline with a 1024x1024 image and a very long prompt, it ~~simply crashes with the same kind of position errors.~~ silently truncates.

Aug 05 '25 14:08 bghira

Hi @bghira @a-r-r-o-w , apologies for the delayed response.

If this is solely for inference, it's a known limitation: the current implementation doesn't support variable-length text prompts in batched mode. As a result, padding isn't critical since the sequences will be truncated in the MMDiT anyway.

For training, the situation is more complex. Internally, we have a Megatron-based implementation that supports cuseq lengths with Flash Attention. However, integrating this may require further discussion and planning.

Let me know your thoughts or use case, and we can explore potential solutions.

Aug 06 '25 05:08 naykun

can i allow users to decide how long their batch tokens can pad to? so they can do 512 tokens and it pads the batch? i am implementing training in simpletuner and want to provide the most in-distribution options for my users

Aug 06 '25 12:08 bghira

This might be related: https://github.com/huggingface/diffusers/issues/12294 @bghira you are saying that padding is done wrong. But padding should be masked anyway, unless it is a known fact that Qwen was trained with unmasked padding (like Flux).

If it should be masked as it clearly was the intention of the diffusers code, padding is not relevant. But as the other issue shows: it is not masked currently.

Sep 06 '25 11:09 dxqb

they used megatron-lm and transformerengine to train qwen-image which, i suppose, does support arbitrary (non-causal) attention masks, but not in the usual manner that we do with SDPA; the transformer engine docs show:

https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/attention/attention.html#3.2-Attention-Mask

Padding masks: For padding, padding_causal, padding_causal_bottom_right mask types, users need to provide sequence length information to help Transformer Engine figure out where each sequence ends in a batch. As of Transformer Engine 2.0, there are two options to do so in PyTorch and one in JAX.

it should be noted that applying this option slows down training to about half speed, and the qwen-image team was very careful about resource efficiency during pretraining, which leans against the idea of them applying the mask.

they mention something about variable-length sequence handling in another thread that i couldn't find after about 20 minutes of searching, but i think that they could be using the tiling approach for masking.

all of that is to say, tl;dr we don't know how to correctly apply it for downstream finetuning unless the modelscope trainer is the canonical example.

Sep 06 '25 12:09 bghira

it's actually a bit more complicated than my original post made it out to be, the max sequence length of 1024 will cause an error if you provide text inputs of this dimension because then the overall token count exceeds the total RoPE canvas size. there was an issue / pull request opened which auto-expands the RoPE frequency range as-needed but i've not been following along with that development.

Sep 06 '25 12:09 bghira

Hii @sayakpaul , I would like to work on this as a part of Diffusers MVP program. Thank you : )

Nov 12 '25 18:11 sambhavnoobcoder

hi @sayakpaul , While exploring this issue more closely , I ended up creating a PR for it! this would be my first contribution to diffusers, though I’ve been a regular contributor to transformers and peft.

would love your insights on how things are usually done in diffusers so I can align accordingly . thanks for the collaborative opportunity — excited to hear your thoughts!

also huge thank you @bghira for opening this issue .

Nov 12 '25 21:11 sambhavnoobcoder

@sambhavnoobcoder could you first discuss the plan with which you plan on tackling this before opening a PR? This helps us reduce back and fourth.

Nov 13 '25 06:11 sayakpaul

Hi @sayakpaul , thank you for the feedback! I apologize for jumping ahead with the PR - I should have discussed the plan first.

Let me outline the approach I took to fix this issue:

Problem Analysis The root cause was that prompt embeddings were being padded dynamically to the batch maximum length instead of a fixed length. This caused the same prompt to receive different RoPE position assignments depending on what other prompts were in the batch, leading to non-deterministic outputs.

Proposed Solution I implemented fixed-length padding where all prompt embeddings are padded to max_sequence_length (default 512, configurable up to 1024) regardless of batch composition. This ensures:

Consistency: Same prompt always gets same padding and RoPE positions
Backward compatibility: Default remains 512, matching current behavior
Flexibility: Users can still configure via max_sequence_length parameter

Implementation Scope

The fix required changes across:

All 8 QwenImage pipeline variants (main, img2img, inpaint, edit, edit_inpaint, edit_plus, controlnet, controlnet_inpaint)
Modular encoder functions
Special handling for Edit pipelines (which process image tokens through text encoder)
RoPE sequence length computation to use padded length instead of actual token count

Testing Strategy

Created test_prompt_embeds_padding() to verify:
Short prompts pad to full max_sequence_length
Mixed-length batches all pad to same length
Custom max_sequence_length values work correctly

I'm happy to discuss any concerns or alternative approaches you might suggest. Should I have taken a different direction with this fix?

Nov 13 '25 19:11 sambhavnoobcoder

@sayakpaul if this proposal seems fine , could you give my PR a look as well ? i was hoping you could review that and we could move forward with that as well .

Nov 18 '25 16:11 sambhavnoobcoder

Hi @yiyixuxu @sayakpaul , I have opened a PR for this in line with my porposed fix , i was wondering if i could get some reviews on the same .

Dec 02 '25 12:12 sambhavnoobcoder

is there a reason this is being ignored? @DN6

Dec 02 '25 14:12 bghira

It's not being ignored. The PRs will be reviewed one by one. There's been a delay because of the flurry of releases we have had in a very short amount of time.

The PRs, especially related to the MVP program can take a long amount of time because majority of them tackle critical things. So, we will take our time.

Dec 02 '25 14:12 sayakpaul

cool , thank you for the clarification @sayakpaul . I'll pick up something else in the meantime , and await your reviews on the PR as well . Also thank you @bghira for bumping this up , and raising this issue in the first place .

Dec 02 '25 14:12 sambhavnoobcoder

@sayakpaul the last diffusers release was oct 14th, all of the new models aren't really usable by packages relying on Pypi releases (they can't use Git tree dependencies)

Dec 02 '25 14:12 bghira