optimum
optimum copied to clipboard
Olmo 7B by AI2 currenlty not support BetterTransformer for optimized inference
Feature request
I am in the process of developing an open-source RAG pipeline, utilizing Olmo 7b for the task at hand. However, I've encountered some GPU limitations, prompting me to implement Quantization, Flash Attention, and BetterTransformers for enhanced and optimized inference. Unfortunately, I've run into an obstacle as the current version of the Hugging Face library does not yet support the BetterTransformer package required for this purpose.
Motivation
The proposal aims to improve the performance of the RAG pipeline in light of GPU limitations. Introducing optimization techniques such as Quantization, Flash Attention, and BetterTransformers is crucial. However, the lack of BetterTransformer support in the Hugging Face library presents a barrier to achieving optimal efficiency. This feature request seeks to address this obstacle, enabling smoother task execution within the pipeline.
Your contribution
Yes I will
Hi @KaifAhmad1, thank you for the report.
Olmo appears to be a model not natively supported in Transformers, rather using custom modeling code: https://huggingface.co/allenai/OLMo-7B/tree/main.
As for text generation, BetterTransformer is only about using SDPA & this is being added natively in Transformers (see https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention), I don't think we will upgrade BetterTransformer further (especially for custom models).
I suggest you to reach out to AllenAI so that they use SDPA in their modeling code (though they already appear to https://github.com/allenai/OLMo/blob/922db6aa17ec05de9f4d8e6e9799f80384021dc4/olmo/model.py#L518C9-L540)