InternVL How to load internvl-chat-1.2-plus on V100.

If I have an 8 V100 machine, is there a way to load InternVL-Chat-Chinese-V1-2-Plus for inference, it seems can't install FlashAttention correctly on V100?

Mar 14 '24 14:03 BIGBALLON

Hi, you can set device_map='auto' to use multiple GPUs for inference.

May I ask if you are currently meeting out-of-memory issues with 8 V100 GPUs without Flash Attention?

path = "OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map='auto').eval()

Mar 16 '24 07:03 czczup

The problem does not seem to be caused by multiple cards, but that it seems impossible to do inference normally without Flash Attention.

Mar 16 '24 08:03 BIGBALLON

Thanks for your feedback. Could you please try to see if the model can be used normally with only 4 GPUs?

Mar 17 '24 12:03 czczup

@czczup 4 GPUs are enough to load 1.2-plus, but will get the error about FlashAttention :

"RuntimeError: FlashAttention only supports Ampere GPUs or newer."

Mar 19 '24 07:03 BIGBALLON

Sorry for the late reply. I would like to ask if you installed Flash Attention on the v100 machine.

If so, could you please uninstall Flash Attention and try again? Because the code is based on whether Flash Attention is installed to determine whether to call it.

Mar 25 '24 12:03 czczup

Sorry for the late reply. I would like to ask if you installed Flash Attention on the v100 machine.

If so, could you please uninstall Flash Attention and try again? Because the code is based on whether Flash Attention is installed to determine whether to call it.

Can we install Flash Attention on V100 machine? How to install it? The official flash attention is not supported on V100.

Mar 27 '24 10:03 bjzhb666

Sorry for the late reply. I would like to ask if you installed Flash Attention on the v100 machine. If so, could you please uninstall Flash Attention and try again? Because the code is based on whether Flash Attention is installed to determine whether to call it.

Can we install Flash Attention on V100 machine? How to install it? The official flash attention is not supported on V100.

Currently, flash attention is likely not supported on V100. It might be possible to use the implementation from Xformers as an alternative. But I do not currently support this.

Mar 27 '24 10:03 czczup

@czczup so currently, we can't load InternVL-Chat-Chinese-V1-2-Plus on V100 machine.

Mar 28 '24 07:03 BIGBALLON

After uninstalling Flash Attention, there is an error: "ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2."

How to disable FlashAttention2 to run it on V100 machine?

Apr 30 '24 04:04 FJR-Nancy

In order to solve the problem of "FlashAttention only supports Ampere GPUs or newer". In this way, it can be solved.

change the content of config.json in InternVL-Chat-V1-2-Plus:

delete "attn_implementation": "flash_attention_2"
set "use_flash_attn": false

May 08 '24 08:05 NiYueLiuFeng