OLMo
OLMo copied to clipboard
Does not support flash attention 2.0 on transformers.AutoModelForCausalLM.from_pretrained
🚀 The feature, motivation and pitch
I am using Olmo 7B for RAG for efficient inference on low GPU resources but does not support flash attention 2.0 Here is the code
model = transformers.AutoModelForCausalLM.from_pretrained(
model_id,
config=model_config,
device_map='auto',
use_flash_attention_2="flash_attention_2",
use_auth_token=hf_auth,
quantization_config=bnb_config,
low_cpu_mem_usage=True
)
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
[<ipython-input-17-96fef6444c74>](https://localhost:8080/#) in <cell line: 1>()
----> 1 model = transformers.AutoModelForCausalLM.from_pretrained(
2 model_id,
3 config=model_config,
4 device_map='auto',
5 use_flash_attention_2="flash_attention_2",
3 frames
[/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py](https://localhost:8080/#) in _check_and_enable_flash_attn_2(cls, config, torch_dtype, device_map, check_device_map, hard_check_only)
1465 """
1466 if not cls._supports_flash_attn_2:
-> 1467 raise ValueError(
1468 f"{cls.__name__} does not support Flash Attention 2.0 yet. Please request to add support where"
1469 f" the model is hosted, on its model hub page: [https://huggingface.co/{config._name_or_path}/discussions/new](https://huggingface.co/%7Bconfig._name_or_path%7D/discussions/new)"
ValueError: OLMoForCausalLM does not support Flash Attention 2.0 yet. Please request to add support where the model is hosted, on its model hub page: https://huggingface.co/allenai/OLMo-7B/discussions/new or in the Transformers GitHub repo: https://github.com/huggingface/transformers/issues/new
Alternatives
No response
Additional context
No response