[feat] LongSANA: a minute-length real-time video generation model
This PR supports LongSANA: a minute-length real-time video generation model
Related links:
project: https://nvlabs.github.io/Sana/Video code: https://github.com/NVlabs/Sana paper: https://arxiv.org/pdf/2509.24695
PR feature:
LongSANA uses Causal Linear Attention KV Cache during inference, which is crucial for long video generation(FlashAttention may need other PR). This PR adds Causal computation logi for both Linear Attention and Mix-FFN (Conv in MLP)
Added classes and functions
- add
SanaVideoCausalTransformerBlockandSanaVideoCausalTransformer3DModel; - add
LongSanaVideoPipelinefor Linear Attention KV-Cache; - support LongSANA converting from pth to diffusers safetensor;
Cc: @sayakpaul @dg845 Co-author: @HeliosZhao
Code snap:
from diffusers import LongSanaVideoPipeline
from diffusers.utils import export_to_video
pipe = LongSanaVideoPipeline.from_pretrained("Efficient-Large-Model/SANA-Video_2B_480p_LongLive_diffusers", torch_dtype=torch.bfloat16)
pipe.scheduler = FlowMatchEulerDiscreteScheduler()
pipe.vae.to(torch.float32)
pipe.text_encoder.to(torch.bfloat16)
pipe.to("cuda")
prompt = "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window."
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
video = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
height=480,
width=832,
frames=161,
guidance_scale=1.0,
timesteps=[1000, 960, 889, 727, 0], # Multi-step denoising per chunk
generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]
export_to_video(video, "longsana.mp4", fps=16)
FlashAttention may need other PR
We can actually leverage our attention backends: https://huggingface.co/docs/diffusers/main/en/optimization/attention_backends
FlashAttention may need other PR
We can actually leverage our attention backends: https://huggingface.co/docs/diffusers/main/en/optimization/attention_backends
Is KV cache is supported in any backends? Actually, in my PR, the kv-cache part is not well organized. So we do need your kind help to do it better to match diffusers style.
Gentle ping @dg845
Hi @lawrence-cj, is the Efficient-Large-Model/SANA-Video_2B_480p_LongLive_diffusers model available on HF Hub? If I try the sample code above, I get an error when trying to load the checkpoint with LongSanaVideoPipeline.from_pretrained. On the hub, I see that there is a Efficient-Large-Model/SANA-Video_2B_480p_LongLive repo but it doesn't look like there is a diffusers variant.
Hi @lawrence-cj, is the
Efficient-Large-Model/SANA-Video_2B_480p_LongLive_diffusersmodel available on HF Hub? If I try the sample code above, I get an error when trying to load the checkpoint withLongSanaVideoPipeline.from_pretrained. On the hub, I see that there is aEfficient-Large-Model/SANA-Video_2B_480p_LongLiverepo but it doesn't look like there is adiffusersvariant.
There is a Efficient-Large-Model/SANA-Video_2B_480p_LongLive_diffusers for diffusers pipeline, but it's now private. Can you access it through internal API?
Hi @lawrence-cj, I don't think I can access it unless I'm specifically given permission (for example, via a read access token).