InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

How to load internvl-chat-1.2-plus on V100.

Open BIGBALLON opened this issue 1 year ago • 9 comments

If I have an 8 V100 machine, is there a way to load InternVL-Chat-Chinese-V1-2-Plus for inference, it seems can't install FlashAttention correctly on V100?

BIGBALLON avatar Mar 14 '24 14:03 BIGBALLON

Hi, you can set device_map='auto' to use multiple GPUs for inference.

May I ask if you are currently meeting out-of-memory issues with 8 V100 GPUs without Flash Attention?

path = "OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map='auto').eval()

czczup avatar Mar 16 '24 07:03 czczup

The problem does not seem to be caused by multiple cards, but that it seems impossible to do inference normally without Flash Attention.

BIGBALLON avatar Mar 16 '24 08:03 BIGBALLON

Thanks for your feedback. Could you please try to see if the model can be used normally with only 4 GPUs?

czczup avatar Mar 17 '24 12:03 czczup

@czczup 4 GPUs are enough to load 1.2-plus, but will get the error about FlashAttention :

"RuntimeError: FlashAttention only supports Ampere GPUs or newer."

BIGBALLON avatar Mar 19 '24 07:03 BIGBALLON

Sorry for the late reply. I would like to ask if you installed Flash Attention on the v100 machine.

If so, could you please uninstall Flash Attention and try again? Because the code is based on whether Flash Attention is installed to determine whether to call it.

czczup avatar Mar 25 '24 12:03 czczup

Sorry for the late reply. I would like to ask if you installed Flash Attention on the v100 machine.

If so, could you please uninstall Flash Attention and try again? Because the code is based on whether Flash Attention is installed to determine whether to call it.

Can we install Flash Attention on V100 machine? How to install it? The official flash attention is not supported on V100.

bjzhb666 avatar Mar 27 '24 10:03 bjzhb666

Sorry for the late reply. I would like to ask if you installed Flash Attention on the v100 machine. If so, could you please uninstall Flash Attention and try again? Because the code is based on whether Flash Attention is installed to determine whether to call it.

Can we install Flash Attention on V100 machine? How to install it? The official flash attention is not supported on V100.

Currently, flash attention is likely not supported on V100. It might be possible to use the implementation from Xformers as an alternative. But I do not currently support this.

czczup avatar Mar 27 '24 10:03 czczup

@czczup so currently, we can't load InternVL-Chat-Chinese-V1-2-Plus on V100 machine.

BIGBALLON avatar Mar 28 '24 07:03 BIGBALLON

After uninstalling Flash Attention, there is an error: "ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2."

How to disable FlashAttention2 to run it on V100 machine?

FJR-Nancy avatar Apr 30 '24 04:04 FJR-Nancy

In order to solve the problem of "FlashAttention only supports Ampere GPUs or newer". In this way, it can be solved.

image

change the content of config.json in InternVL-Chat-V1-2-Plus:

  1. delete "attn_implementation": "flash_attention_2"
  2. set "use_flash_attn": false

NiYueLiuFeng avatar May 08 '24 08:05 NiYueLiuFeng