InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

Solution for 'FlashAttention only supports Ampere GPUs or newer' Error on V100 GPUs

Open vedernikovphoto opened this issue 1 year ago • 2 comments

Hi,

I encountered an issue while trying to run the InternVL2-8B model on an NVIDIA V100 GPU, where I received the error FlashAttention only supports Ampere GPUs or newer.

I found a solution in the issues section for Internvl-chat-1.2-plus model version and applied it successfully. Here are the steps I took:

  1. Go to the config.json in the folder with weights downloaded from Huggingface.
  2. Delete the line "attn_implementation": "flash_attention_2"
  3. Set "use_flash_attn": false

This resolved the issue, and the model is now running successfully on my V100 GPU.

I hope this helps others who might face the same problem.

vedernikovphoto avatar Jul 26 '24 16:07 vedernikovphoto

Thank you for your feedback

czczup avatar Jul 30 '24 12:07 czczup

Hi,

I encountered an issue while trying to run the InternVL2-8B model on an NVIDIA V100 GPU, where I received the error FlashAttention only supports Ampere GPUs or newer.

I found a solution in the issues section for Internvl-chat-1.2-plus model version and applied it successfully. Here are the steps I took:

  1. Go to the config.json in the folder with weights downloaded from Huggingface.
  2. Delete the line "attn_implementation": "flash_attention_2"
  3. Set "use_flash_attn": false

This resolved the issue, and the model is now running successfully on my V100 GPU.

I hope this helps others who might face the same problem.

I met the same problem today, I am so grateful that such a solution exits!!! Thanks bro

SHIMURA0 avatar Aug 06 '24 05:08 SHIMURA0

now the author has already modified codes, so that you can decide if use flash attention by setting use_flash_attn:

path = 'OpenGVLab/InternVL2-8B'
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()

XingWang1234 avatar Aug 25 '24 13:08 XingWang1234

Yes, when using V100 GPUs, you can manually disable flash attention by setting use_flash_attn=False.

czczup avatar Aug 26 '24 06:08 czczup

now the author has already modified codes, so that you can decide if use flash attention by setting use_flash_attn:

path = 'OpenGVLab/InternVL2-8B'
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()

Sorry it can't works on the newest transformers or git responsibility. Could you tell me how to make the envs.

591582523h avatar Sep 19 '24 15:09 591582523h

你好,

我在尝试在 NVIDIA V100 GPU 上运行模型时遇到了问题InternVL2-8B,收到错误FlashAttention only supports Ampere GPUs or newer

我在模型版本的问题部分找到了解决方案Internvl-chat-1.2-plus并成功应用了它。以下是我采取的步骤:

  1. 转到config.json从 Huggingface 下载的包含权重的文件夹。
  2. 删除该行"attn_implementation": "flash_attention_2"
  3. "use_flash_attn": false

这解决了这个问题,模型现在可以在我的 V100 GPU 上成功运行。

我希望这能帮助其他可能面临同样问题的人。

But how can I finetune the model without the flash-attn? Thanks!

popoyaya avatar Jan 02 '25 11:01 popoyaya