InternVL Solution for 'FlashAttention only supports Ampere GPUs or newer' Error on V100 GPUs

Hi,

I encountered an issue while trying to run the InternVL2-8B model on an NVIDIA V100 GPU, where I received the error FlashAttention only supports Ampere GPUs or newer.

I found a solution in the issues section for Internvl-chat-1.2-plus model version and applied it successfully. Here are the steps I took:

Go to the config.json in the folder with weights downloaded from Huggingface.
Delete the line "attn_implementation": "flash_attention_2"
Set "use_flash_attn": false

This resolved the issue, and the model is now running successfully on my V100 GPU.

I hope this helps others who might face the same problem.

Jul 26 '24 16:07 vedernikovphoto

Thank you for your feedback

Jul 30 '24 12:07 czczup

Hi,

I encountered an issue while trying to run the InternVL2-8B model on an NVIDIA V100 GPU, where I received the error FlashAttention only supports Ampere GPUs or newer.

I found a solution in the issues section for Internvl-chat-1.2-plus model version and applied it successfully. Here are the steps I took:

Go to the config.json in the folder with weights downloaded from Huggingface.

Delete the line "attn_implementation": "flash_attention_2"

Set "use_flash_attn": false

This resolved the issue, and the model is now running successfully on my V100 GPU.

I hope this helps others who might face the same problem.

I met the same problem today, I am so grateful that such a solution exits!!! Thanks bro

Aug 06 '24 05:08 SHIMURA0

now the author has already modified codes, so that you can decide if use flash attention by setting use_flash_attn:

path = 'OpenGVLab/InternVL2-8B'
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()

Aug 25 '24 13:08 XingWang1234

Yes, when using V100 GPUs, you can manually disable flash attention by setting use_flash_attn=False.

Aug 26 '24 06:08 czczup

now the author has already modified codes, so that you can decide if use flash attention by setting use_flash_attn:
path = 'OpenGVLab/InternVL2-8B'
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()

Sorry it can't works on the newest transformers or git responsibility. Could you tell me how to make the envs.

Sep 19 '24 15:09 591582523h

你好，

我在尝试在 NVIDIA V100 GPU 上运行模型时遇到了问题InternVL2-8B，收到错误FlashAttention only supports Ampere GPUs or newer。

我在模型版本的问题部分找到了解决方案Internvl-chat-1.2-plus并成功应用了它。以下是我采取的步骤：

转到config.json从 Huggingface 下载的包含权重的文件夹。

删除该行"attn_implementation": "flash_attention_2"

放"use_flash_attn": false

这解决了这个问题，模型现在可以在我的 V100 GPU 上成功运行。

我希望这能帮助其他可能面临同样问题的人。

But how can I finetune the model without the flash-attn? Thanks！

Jan 02 '25 11:01 popoyaya