How to load internvl-chat-1.2-plus on V100.
If I have an 8 V100 machine, is there a way to load InternVL-Chat-Chinese-V1-2-Plus for inference, it seems can't install FlashAttention correctly on V100?
Hi, you can set device_map='auto' to use multiple GPUs for inference.
May I ask if you are currently meeting out-of-memory issues with 8 V100 GPUs without Flash Attention?
path = "OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus"
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True,
device_map='auto').eval()
The problem does not seem to be caused by multiple cards, but that it seems impossible to do inference normally without Flash Attention.
Thanks for your feedback. Could you please try to see if the model can be used normally with only 4 GPUs?
@czczup 4 GPUs are enough to load 1.2-plus, but will get the error about FlashAttention :
"RuntimeError: FlashAttention only supports Ampere GPUs or newer."
Sorry for the late reply. I would like to ask if you installed Flash Attention on the v100 machine.
If so, could you please uninstall Flash Attention and try again? Because the code is based on whether Flash Attention is installed to determine whether to call it.
Sorry for the late reply. I would like to ask if you installed Flash Attention on the v100 machine.
If so, could you please uninstall Flash Attention and try again? Because the code is based on whether Flash Attention is installed to determine whether to call it.
Can we install Flash Attention on V100 machine? How to install it? The official flash attention is not supported on V100.
Sorry for the late reply. I would like to ask if you installed Flash Attention on the v100 machine. If so, could you please uninstall Flash Attention and try again? Because the code is based on whether Flash Attention is installed to determine whether to call it.
Can we install Flash Attention on V100 machine? How to install it? The official flash attention is not supported on V100.
Currently, flash attention is likely not supported on V100. It might be possible to use the implementation from Xformers as an alternative. But I do not currently support this.
@czczup so currently, we can't load InternVL-Chat-Chinese-V1-2-Plus on V100 machine.
After uninstalling Flash Attention, there is an error: "ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2."
How to disable FlashAttention2 to run it on V100 machine?
In order to solve the problem of "FlashAttention only supports Ampere GPUs or newer". In this way, it can be solved.
change the content of config.json in InternVL-Chat-V1-2-Plus:
- delete
"attn_implementation": "flash_attention_2" - set
"use_flash_attn": false