VILA Inference on older microarchitectures-issue with FlashAttention

Inference on older microarchitectures-issue with FlashAttention

Open kldev2000 opened this issue 7 months ago • 2 comments

Hello everyone, thanks for this amazing work!

I'm trying to run inference using NVILA-8B model on NVIDIA V100 GPU but facing issue. I understand from the model requirements that NVILA supports certain microarchitectures only, but running inference on this NVIDIA V100 GPU is a strict requirement for me.

Any solution to run inference on NVIDIA V100 would be of great help, so thanks in advance!!!

Getting this: RuntimeError: FlashAttention only supports Ampere GPUs or newer.

Apr 23 '25 11:04 kldev2000

We do not have V100 hardware to test. You may try to replace FlashAttention with other implementations.

Apr 27 '25 01:04 Lyken17

Thanks for the quick and helpful response. Can you please help me finding in which code file can I find FlashAttention so that I can replace it with earlier implementations? Thanks in advance.

May 06 '25 05:05 kldev2000

VILA VILA copied to clipboard

Inference on older microarchitectures-issue with FlashAttention

VILA
VILA copied to clipboard