optimum bettertransformer throws RuntimeError with transformers>4.49

System Info

Trying to use latest optimum v1.25.3 with Infinity embedding server 0.0.76 and transformers v4.51.3 inside an Ubuntu based Docker image with torch 2.7.0.

Who can help?

Per #2262 support for transformers>=4.51 was implemented. I hoped that would resolve my RuntimeError I get when starting Infinity embedding server. Specifically I see INFO 2025-05-22 03:07:15,886 datasets INFO: PyTorch version config.py:54 2.7.0 available. Traceback (most recent call last): File "/usr/local/bin/infinity_emb", line 5, in <module> from infinity_emb.cli import cli File "/usr/local/lib/python3.10/dist-packages/infinity_emb/__init__.py", line 27, in <module> from infinity_emb.engine import AsyncEmbeddingEngine, AsyncEngineArray # noqa: E402 File "/usr/local/lib/python3.10/dist-packages/infinity_emb/engine.py", line 11, in <module> from infinity_emb.inference import ( File "/usr/local/lib/python3.10/dist-packages/infinity_emb/inference/__init__.py", line 4, in <module> from infinity_emb.inference.batch_handler import BatchHandler File "/usr/local/lib/python3.10/dist-packages/infinity_emb/inference/batch_handler.py", line 39, in <module> from infinity_emb.transformer.utils import get_lengths_with_tokenize File "/usr/local/lib/python3.10/dist-packages/infinity_emb/transformer/utils.py", line 9, in <module> from infinity_emb.transformer.classifier.torch import SentenceClassifier File "/usr/local/lib/python3.10/dist-packages/infinity_emb/transformer/classifier/torch.py", line 8, in <module> from infinity_emb.transformer.acceleration import ( File "/usr/local/lib/python3.10/dist-packages/infinity_emb/transformer/acceleration.py", line 11, in <module> from optimum.bettertransformer import ( # type: ignore[import-untyped] File "/usr/local/lib/python3.10/dist-packages/optimum/bettertransformer/__init__.py", line 20, in <module> raise RuntimeError( RuntimeError: BetterTransformer requires transformers<4.49 but found 4.51.3. optimum.bettertransformer is deprecated and will be removed in optimum v2.0.. However even with latest optimum version the issue persists.

I think this is the problem: https://github.com/huggingface/optimum/blob/e15053d33e60f42bb87389a869c3a9d823ea972f/optimum/bettertransformer/init.py#L19. Can this be updated to cover newer versions of transformers?

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

Install infinity and force newes version of optimum: pip install optimum>=1.25.3 infinity-emb[server,torch,optimum,einops,cache]. Then try to run Infinity with infinity_emb v2 --model-id BAAI/bge-small-en-v1.5

Expected behavior

I expect Infinity to start without a runtime error when using optimum 1.25.3 and transformers 4.51.3

May 22 '25 04:05 mtrmarko

True, so far v1.24.0 of optimum is pinned - will take a look.

May 22 '25 15:05 michaelfeil

Bettertransformers has been deprecated for a couple versions now, transformers already implements SDPA and other attention implementations like Flash Attention v1/v2/v3.

May 25 '25 20:05 IlyasMoutawwakil

I still think this is a bug. If something is deprecated it should emit DeprecationWarning, not raise RuntimeError.

May 28 '25 18:05 goatsweater

Dirty one-liner fix (with sharkdp/fd):

sed -i 's/raise RuntimeError/print/g' $(fd -HI __init__.py | grep 'bettertransformer/_')

Jun 25 '25 10:06 fanyang89

we are removing it (bettertransformer) in next version 🤗

Jun 25 '25 10:06 IlyasMoutawwakil

please use transformers' attention implementation: https://huggingface.co/docs/transformers/main/en/llm_optims#attention and torch.compile (with static cache if decoder): https://huggingface.co/docs/transformers/main/en/llm_optims#static-kv-cache-and-torchcompile for the best possible performance (exceeding bettertransformer, which no one maintains! 💀).

Jun 25 '25 10:06 IlyasMoutawwakil

This might come across as nitpicky...why why is the documentation telling me to use a deprecated feature?

"For comparison, let’s run the same function, but enable Flash Attention instead. To do so, we convert the model to BetterTransformer and by doing so enabling PyTorch’s which in turn is able to use Flash Attention. model.to_bettertransformer()"

Source: https://huggingface.co/docs/transformers/main/en/llm_tutorial_optimization

Sep 10 '25 17:09 Klaws--