Use PyTorch scaled_dot_product_attention when Flash Attention is not available

Open bayley opened this issue 1 year ago • 1 comments

In the current code, if Flash Attention is not available, models.layers.blocks.Attention falls back to naive attention which directly materializes the attention matrix, OOM'ing immediately even for modest generation lengths.

Can we fall back to torch.nn.functional.scaled_dot_product_attention instead so that pre-Ampere users can run inference as well?

Jun 29 '24 02:06 bayley

This issue is stale because it has been open for 7 days with no activity.

Jul 07 '24 01:07 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

Jul 14 '24 01:07 github-actions[bot]