OLMo Does not support flash attention 2.0 on transformers.AutoModelForCausalLM.from

Does not support flash attention 2.0 on transformers.AutoModelForCausalLM.from_pretrained

Open KaifAhmad1 opened this issue 1 year ago • 0 comments

🚀 The feature, motivation and pitch

I am using Olmo 7B for RAG for efficient inference on low GPU resources but does not support flash attention 2.0 Here is the code

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    config=model_config,
    device_map='auto',
    use_flash_attention_2="flash_attention_2",
    use_auth_token=hf_auth,
    quantization_config=bnb_config,
    low_cpu_mem_usage=True
)

The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-17-96fef6444c74>](https://localhost:8080/#) in <cell line: 1>()
----> 1 model = transformers.AutoModelForCausalLM.from_pretrained(
      2     model_id,
      3     config=model_config,
      4     device_map='auto',
      5     use_flash_attention_2="flash_attention_2",

3 frames
[/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py](https://localhost:8080/#) in _check_and_enable_flash_attn_2(cls, config, torch_dtype, device_map, check_device_map, hard_check_only)
   1465         """
   1466         if not cls._supports_flash_attn_2:
-> 1467             raise ValueError(
   1468                 f"{cls.__name__} does not support Flash Attention 2.0 yet. Please request to add support where"
   1469                 f" the model is hosted, on its model hub page: [https://huggingface.co/{config._name_or_path}/discussions/new](https://huggingface.co/%7Bconfig._name_or_path%7D/discussions/new)"

ValueError: OLMoForCausalLM does not support Flash Attention 2.0 yet. Please request to add support where the model is hosted, on its model hub page: https://huggingface.co/allenai/OLMo-7B/discussions/new or in the Transformers GitHub repo: https://github.com/huggingface/transformers/issues/new

Alternatives

No response

Additional context

No response

Feb 20 '24 15:02 KaifAhmad1

OLMo OLMo copied to clipboard

Does not support flash attention 2.0 on transformers.AutoModelForCausalLM.from_pretrained

🚀 The feature, motivation and pitch

Alternatives

Additional context

OLMo
OLMo copied to clipboard