Open-Sora
Open-Sora copied to clipboard
Use PyTorch scaled_dot_product_attention when Flash Attention is not available
In the current code, if Flash Attention is not available, models.layers.blocks.Attention falls back to naive attention which directly materializes the attention matrix, OOM'ing immediately even for modest generation lengths.
Can we fall back to torch.nn.functional.scaled_dot_product_attention instead so that pre-Ampere users can run inference as well?
This issue is stale because it has been open for 7 days with no activity.
This issue was closed because it has been inactive for 7 days since being marked as stale.