Solution for 'FlashAttention only supports Ampere GPUs or newer' Error on V100 GPUs
Hi,
I encountered an issue while trying to run the InternVL2-8B model on an NVIDIA V100 GPU, where I received the error FlashAttention only supports Ampere GPUs or newer.
I found a solution in the issues section for Internvl-chat-1.2-plus model version and applied it successfully. Here are the steps I took:
- Go to the
config.jsonin the folder with weights downloaded from Huggingface. - Delete the line
"attn_implementation": "flash_attention_2" - Set
"use_flash_attn": false
This resolved the issue, and the model is now running successfully on my V100 GPU.
I hope this helps others who might face the same problem.
Thank you for your feedback
Hi,
I encountered an issue while trying to run the
InternVL2-8Bmodel on an NVIDIA V100 GPU, where I received the errorFlashAttention only supports Ampere GPUs or newer.I found a solution in the issues section for
Internvl-chat-1.2-plusmodel version and applied it successfully. Here are the steps I took:
- Go to the
config.jsonin the folder with weights downloaded from Huggingface.- Delete the line
"attn_implementation": "flash_attention_2"- Set
"use_flash_attn": falseThis resolved the issue, and the model is now running successfully on my V100 GPU.
I hope this helps others who might face the same problem.
I met the same problem today, I am so grateful that such a solution exits!!! Thanks bro
now the author has already modified codes, so that you can decide if use flash attention by setting use_flash_attn:
path = 'OpenGVLab/InternVL2-8B'
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
Yes, when using V100 GPUs, you can manually disable flash attention by setting use_flash_attn=False.
now the author has already modified codes, so that you can decide if use flash attention by setting
use_flash_attn:path = 'OpenGVLab/InternVL2-8B' model = AutoModel.from_pretrained( path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, use_flash_attn=True, trust_remote_code=True).eval().cuda()
Sorry it can't works on the newest transformers or git responsibility. Could you tell me how to make the envs.
你好,
我在尝试在 NVIDIA V100 GPU 上运行模型时遇到了问题
InternVL2-8B,收到错误FlashAttention only supports Ampere GPUs or newer。我在模型版本的问题部分找到了解决方案
Internvl-chat-1.2-plus并成功应用了它。以下是我采取的步骤:
- 转到
config.json从 Huggingface 下载的包含权重的文件夹。- 删除该行
"attn_implementation": "flash_attention_2"- 放
"use_flash_attn": false这解决了这个问题,模型现在可以在我的 V100 GPU 上成功运行。
我希望这能帮助其他可能面临同样问题的人。
But how can I finetune the model without the flash-attn? Thanks!