optimum Olmo 7B by AI2 currenlty not support BetterTransformer for optimized inference

Olmo 7B by AI2 currenlty not support BetterTransformer for optimized inference

Open KaifAhmad1 opened this issue 1 year ago • 1 comments

Feature request

I am in the process of developing an open-source RAG pipeline, utilizing Olmo 7b for the task at hand. However, I've encountered some GPU limitations, prompting me to implement Quantization, Flash Attention, and BetterTransformers for enhanced and optimized inference. Unfortunately, I've run into an obstacle as the current version of the Hugging Face library does not yet support the BetterTransformer package required for this purpose.

Motivation

The proposal aims to improve the performance of the RAG pipeline in light of GPU limitations. Introducing optimization techniques such as Quantization, Flash Attention, and BetterTransformers is crucial. However, the lack of BetterTransformer support in the Hugging Face library presents a barrier to achieving optimal efficiency. This feature request seeks to address this obstacle, enabling smoother task execution within the pipeline.

Your contribution

Yes I will

Feb 20 '24 16:02 KaifAhmad1

Hi @KaifAhmad1, thank you for the report.

Olmo appears to be a model not natively supported in Transformers, rather using custom modeling code: https://huggingface.co/allenai/OLMo-7B/tree/main.

As for text generation, BetterTransformer is only about using SDPA & this is being added natively in Transformers (see https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention), I don't think we will upgrade BetterTransformer further (especially for custom models).

I suggest you to reach out to AllenAI so that they use SDPA in their modeling code (though they already appear to https://github.com/allenai/OLMo/blob/922db6aa17ec05de9f4d8e6e9799f80384021dc4/olmo/model.py#L518C9-L540)

Feb 26 '24 11:02 fxmarty

optimum optimum copied to clipboard

Olmo 7B by AI2 currenlty not support BetterTransformer for optimized inference

Feature request

Motivation

Your contribution

optimum
optimum copied to clipboard