diffusers [Feature] Ulysses Attention for any sequence length w/o padding

🤖UAA: Ulysses Anything Attention

We have implemented the 📚UAA: Ulysses Anything Attention: An Ulysses Attention that supports arbitrary sequence length with ✅zero padding and nearly ✅zero theoretical communication overhead. The default Ulysses Attention requires that the sequence len of hidden states must be divisible by the number of devices. This imposes significant limitations on the practical application of Ulysses.

# pip3 install "cache-dit[parallelism]"
from cache_dit import ParallelismConfig

cache_dit.enable_cache(
    pipe_or_adapter, 
    cache_config=DBCacheConfig(...),
    # Set `experimental_ulysses_anything` as True to enable UAA
    parallelism_config=ParallelismConfig(
        ulysses_size=2,
        parallel_kwargs={
            "experimental_ulysses_anything": True
        },
    ),
)
# torchrun --nproc_per_node=2 parallel_cache_ulysses_anything.py

For example, in the T2I and I2V tasks, the length of prompts input by users is often variable, and it is difficult to ensure that this length is divisible by the number of devices. To address this issue, we have developed a ✅padding-free Ulysses Attention (UAA) for arbitrary sequence length, which enhances the versatility of Ulysses.

dist.init_process_group(backend="cpu:gloo,cuda:nccl")

Compared to Ulysses Attention, in UAA, we have only added an extra all-gather op for scalar types to gather the seq_len value of each rank. To avoid multiple forced CUDA sync caused by H2D and D2H transfers, please add the ✅gloo backend in init_process_group. This will significantly reduce commucation latency.

U*: Ulysses Attention, UAA: Ulysses Anything Attenton, UAA*: UAA + Gloo, Device: NVIDIA L20
FLUX.1-Dev w/o CPU Offload, 28 steps; Qwen-Image w/ CPU Offload, 50 steps; Gloo: Extra All Gather w/ Gloo

CP2 w/ U*	CP2 w/ UAA*	CP2 w/ UAA	L20x1	CP2 w/ UAA*	CP2 w/ U*	L20x1	CP2 w/ UAA*
FLUX, 13.87s	🎉13.88s	14.75s	23.25s	🎉13.75s	Qwen, 132s	181s	🎉133s

1024x1024	1024x1024	1024x1024	1008x1008	1008x1008	1312x1312	1328x1328	1328x1328
✔️U* ✔️UAA	✔️U* ✔️UAA	✔️U* ✔️UAA	NO CP	❌U* ✔️UAA	✔️U* ✔️UAA	NO CP	❌U* ✔️UAA

[!Important] Please note that Ulysses Anything Attention (UAA) is currently an experimental feature. It has not undergone large-scale testing, and may introduce a slight performance degradation while the cpu:gloo commucation backend is not available.

@sayakpaul @DN6 Please let me know If you want to have UAA in diffusers. I'd be more than happy to submit a PR to support it. The implementation of UAA is here: https://github.com/vipshop/cache-dit/blob/main/src/cache_dit/parallelism/backends/native_diffusers/context_parallelism/attention/_templated_ulysses_anything.py

Nov 24 '25 11:11 DefTruth

How does this fare to unified attention?

Nov 24 '25 11:11 sayakpaul

How does this fare to unified attention?

I believe UAA would be a better implementation of Ulysses attention for any sequence length. It offers zero padding, near-zero communication overhead, minimum extra IO access, and is PyTorch-native (leveraging all_to_all_single with uneven input/output split sizes is all that's required). This significantly enhances the versatility of the Ulysses attention. Tests conducted on FLUX.1 and Qwen-Image (exclusively on NVIDIA L20 hardware, as I do not currently have access to H100/H200/B200) demonstrate that, compared to standard Ulysses, UAA introduces only a slight latency overhead (~1%). However, it can handle arbitrary resolutions and prompt token lengths in scenarios where standard Ulysses often fails.

Nov 25 '25 01:11 DefTruth

Sounds like a good option to me. @DefTruth would you like to work on adding it?

Nov 27 '25 05:11 DN6

Sounds like a good option to me. @DefTruth would you like to work on adding it?

My pleasure. However, I've been quite busy lately—I'll submit the implementation of UAA to diffusers when I have some free time.

Nov 27 '25 05:11 DefTruth