Qwen Image prompt encoding is not padding to max seq len
Describe the bug
The pipeline method for QwenImagePipeline.encode_prompts is not padding correctly; it's padding by the longest sequence length in the batch, which leaves very very short embeds that are out of distribution for the models' training set.
The padding should remain at 1024 tokens even after the system prompt is dropped. The attention mask has to be expanded too.
Reproduction
Execute QwenImagePipeline.encode_prompts() and check resulting shape.
Logs
prompt_embeds.shape=torch.Size([1, 1024, 3584]), prompt_embeds_mask.shape=torch.Size([1, 1024])
^ after fixing.
prompt_embeds.shape=torch.Size([1, 5, 3584]) prompt_embeds_mask.shape=torch.Size([1, 5])
^ before.
prompt was simply minecraft
this leads to extremely high loss at training time unless very-long prompts are used.
at inference time, it causes patch embed artifacts because the RoPE is not accustomed to these positions.
System Info
Latest git main.
Who can help?
No response
A fix should be easy to add quickly if this is a problem, but I'll cc @naykun from Qwen team to take a look if this is intended or unexpected. I do see us not making use of max_sequence_length, so most likely unexpected
~~well, loss is at 3.6 without the fix, and normal range with it 🤔~~
the loss is due to my use of the VAE, not the text embed sequence length.
the text embed sequence length merely promotes visual artifacts and burnt learning.
the problem is that because the text positions and img positions share the same RoPE (tricky) i can't give the text embeds purely the max sequence length, because then there are no positions left for image tokens. it definitely needs some answers from Qwen team, maybe we just can't use short training captions on this model at all.
if we run the pipeline with a 1024x1024 image and a very long prompt, it ~~simply crashes with the same kind of position errors.~~ silently truncates.
Hi @bghira @a-r-r-o-w , apologies for the delayed response.
If this is solely for inference, it's a known limitation: the current implementation doesn't support variable-length text prompts in batched mode. As a result, padding isn't critical since the sequences will be truncated in the MMDiT anyway.
For training, the situation is more complex. Internally, we have a Megatron-based implementation that supports cuseq lengths with Flash Attention. However, integrating this may require further discussion and planning.
Let me know your thoughts or use case, and we can explore potential solutions.
can i allow users to decide how long their batch tokens can pad to? so they can do 512 tokens and it pads the batch? i am implementing training in simpletuner and want to provide the most in-distribution options for my users
This might be related: https://github.com/huggingface/diffusers/issues/12294 @bghira you are saying that padding is done wrong. But padding should be masked anyway, unless it is a known fact that Qwen was trained with unmasked padding (like Flux).
If it should be masked as it clearly was the intention of the diffusers code, padding is not relevant. But as the other issue shows: it is not masked currently.
they used megatron-lm and transformerengine to train qwen-image which, i suppose, does support arbitrary (non-causal) attention masks, but not in the usual manner that we do with SDPA; the transformer engine docs show:
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/attention/attention.html#3.2-Attention-Mask
Padding masks: For
padding,padding_causal,padding_causal_bottom_rightmask types, users need to provide sequence length information to help Transformer Engine figure out where each sequence ends in a batch. As of Transformer Engine 2.0, there are two options to do so in PyTorch and one in JAX.
it should be noted that applying this option slows down training to about half speed, and the qwen-image team was very careful about resource efficiency during pretraining, which leans against the idea of them applying the mask.
they mention something about variable-length sequence handling in another thread that i couldn't find after about 20 minutes of searching, but i think that they could be using the tiling approach for masking.
all of that is to say, tl;dr we don't know how to correctly apply it for downstream finetuning unless the modelscope trainer is the canonical example.
it's actually a bit more complicated than my original post made it out to be, the max sequence length of 1024 will cause an error if you provide text inputs of this dimension because then the overall token count exceeds the total RoPE canvas size. there was an issue / pull request opened which auto-expands the RoPE frequency range as-needed but i've not been following along with that development.
Hii @sayakpaul , I would like to work on this as a part of Diffusers MVP program. Thank you : )
hi @sayakpaul , While exploring this issue more closely , I ended up creating a PR for it! this would be my first contribution to diffusers, though I’ve been a regular contributor to transformers and peft.
would love your insights on how things are usually done in diffusers so I can align accordingly . thanks for the collaborative opportunity — excited to hear your thoughts!
also huge thank you @bghira for opening this issue .
@sambhavnoobcoder could you first discuss the plan with which you plan on tackling this before opening a PR? This helps us reduce back and fourth.
Hi @sayakpaul , thank you for the feedback! I apologize for jumping ahead with the PR - I should have discussed the plan first.
Let me outline the approach I took to fix this issue:
Problem Analysis The root cause was that prompt embeddings were being padded dynamically to the batch maximum length instead of a fixed length. This caused the same prompt to receive different RoPE position assignments depending on what other prompts were in the batch, leading to non-deterministic outputs.
Proposed Solution I implemented fixed-length padding where all prompt embeddings are padded to max_sequence_length (default 512, configurable up to 1024) regardless of batch composition. This ensures:
- Consistency: Same prompt always gets same padding and RoPE positions
- Backward compatibility: Default remains 512, matching current behavior
- Flexibility: Users can still configure via max_sequence_length parameter
Implementation Scope
The fix required changes across:
- All 8 QwenImage pipeline variants (main, img2img, inpaint, edit, edit_inpaint, edit_plus, controlnet, controlnet_inpaint)
- Modular encoder functions
- Special handling for Edit pipelines (which process image tokens through text encoder)
- RoPE sequence length computation to use padded length instead of actual token count
Testing Strategy
- Created test_prompt_embeds_padding() to verify:
- Short prompts pad to full max_sequence_length
- Mixed-length batches all pad to same length
- Custom max_sequence_length values work correctly
I'm happy to discuss any concerns or alternative approaches you might suggest. Should I have taken a different direction with this fix?
@sayakpaul if this proposal seems fine , could you give my PR a look as well ? i was hoping you could review that and we could move forward with that as well .
Hi @yiyixuxu @sayakpaul , I have opened a PR for this in line with my porposed fix , i was wondering if i could get some reviews on the same .
is there a reason this is being ignored? @DN6
It's not being ignored. The PRs will be reviewed one by one. There's been a delay because of the flurry of releases we have had in a very short amount of time.
The PRs, especially related to the MVP program can take a long amount of time because majority of them tackle critical things. So, we will take our time.
cool , thank you for the clarification @sayakpaul . I'll pick up something else in the meantime , and await your reviews on the PR as well . Also thank you @bghira for bumping this up , and raising this issue in the first place .
@sayakpaul the last diffusers release was oct 14th, all of the new models aren't really usable by packages relying on Pypi releases (they can't use Git tree dependencies)