[Feature] Ulysses Attention for any sequence length w/o padding
🤖UAA: Ulysses Anything Attention
We have implemented the 📚UAA: Ulysses Anything Attention: An Ulysses Attention that supports arbitrary sequence length with ✅zero padding and nearly ✅zero theoretical communication overhead. The default Ulysses Attention requires that the sequence len of hidden states must be divisible by the number of devices. This imposes significant limitations on the practical application of Ulysses.
# pip3 install "cache-dit[parallelism]"
from cache_dit import ParallelismConfig
cache_dit.enable_cache(
pipe_or_adapter,
cache_config=DBCacheConfig(...),
# Set `experimental_ulysses_anything` as True to enable UAA
parallelism_config=ParallelismConfig(
ulysses_size=2,
parallel_kwargs={
"experimental_ulysses_anything": True
},
),
)
# torchrun --nproc_per_node=2 parallel_cache_ulysses_anything.py
For example, in the T2I and I2V tasks, the length of prompts input by users is often variable, and it is difficult to ensure that this length is divisible by the number of devices. To address this issue, we have developed a ✅padding-free Ulysses Attention (UAA) for arbitrary sequence length, which enhances the versatility of Ulysses.
dist.init_process_group(backend="cpu:gloo,cuda:nccl")
Compared to Ulysses Attention, in UAA, we have only added an extra all-gather op for scalar types to gather the seq_len value of each rank. To avoid multiple forced CUDA sync caused by H2D and D2H transfers, please add the ✅gloo backend in init_process_group. This will significantly reduce commucation latency.
U*: Ulysses Attention, UAA: Ulysses Anything Attenton, UAA*: UAA + Gloo, Device: NVIDIA L20
FLUX.1-Dev w/o CPU Offload, 28 steps; Qwen-Image w/ CPU Offload, 50 steps; Gloo: Extra All Gather w/ Gloo
| CP2 w/ U* | CP2 w/ UAA* | CP2 w/ UAA | L20x1 | CP2 w/ UAA* | CP2 w/ U* | L20x1 | CP2 w/ UAA* |
|---|---|---|---|---|---|---|---|
| FLUX, 13.87s | 🎉13.88s | 14.75s | 23.25s | 🎉13.75s | Qwen, 132s | 181s | 🎉133s |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| 1024x1024 | 1024x1024 | 1024x1024 | 1008x1008 | 1008x1008 | 1312x1312 | 1328x1328 | 1328x1328 |
| ✔️U* ✔️UAA | ✔️U* ✔️UAA | ✔️U* ✔️UAA | NO CP | ❌U* ✔️UAA | ✔️U* ✔️UAA | NO CP | ❌U* ✔️UAA |
[!Important] Please note that Ulysses Anything Attention (UAA) is currently an experimental feature. It has not undergone large-scale testing, and may introduce a slight performance degradation while the
cpu:gloocommucation backend is not available.
@sayakpaul @DN6 Please let me know If you want to have UAA in diffusers. I'd be more than happy to submit a PR to support it. The implementation of UAA is here: https://github.com/vipshop/cache-dit/blob/main/src/cache_dit/parallelism/backends/native_diffusers/context_parallelism/attention/_templated_ulysses_anything.py
How does this fare to unified attention?
How does this fare to unified attention?
I believe UAA would be a better implementation of Ulysses attention for any sequence length. It offers zero padding, near-zero communication overhead, minimum extra IO access, and is PyTorch-native (leveraging all_to_all_single with uneven input/output split sizes is all that's required). This significantly enhances the versatility of the Ulysses attention. Tests conducted on FLUX.1 and Qwen-Image (exclusively on NVIDIA L20 hardware, as I do not currently have access to H100/H200/B200) demonstrate that, compared to standard Ulysses, UAA introduces only a slight latency overhead (~1%). However, it can handle arbitrary resolutions and prompt token lengths in scenarios where standard Ulysses often fails.
Sounds like a good option to me. @DefTruth would you like to work on adding it?
Sounds like a good option to me. @DefTruth would you like to work on adding it?
My pleasure. However, I've been quite busy lately—I'll submit the implementation of UAA to diffusers when I have some free time.






