bettertransformer throws RuntimeError with transformers>4.49
System Info
Trying to use latest optimum v1.25.3 with Infinity embedding server 0.0.76 and transformers v4.51.3 inside an Ubuntu based Docker image with torch 2.7.0.
Who can help?
Per #2262 support for transformers>=4.51 was implemented. I hoped that would resolve my RuntimeError I get when starting Infinity embedding server. Specifically I see INFO 2025-05-22 03:07:15,886 datasets INFO: PyTorch version config.py:54 2.7.0 available. Traceback (most recent call last): File "/usr/local/bin/infinity_emb", line 5, in <module> from infinity_emb.cli import cli File "/usr/local/lib/python3.10/dist-packages/infinity_emb/__init__.py", line 27, in <module> from infinity_emb.engine import AsyncEmbeddingEngine, AsyncEngineArray # noqa: E402 File "/usr/local/lib/python3.10/dist-packages/infinity_emb/engine.py", line 11, in <module> from infinity_emb.inference import ( File "/usr/local/lib/python3.10/dist-packages/infinity_emb/inference/__init__.py", line 4, in <module> from infinity_emb.inference.batch_handler import BatchHandler File "/usr/local/lib/python3.10/dist-packages/infinity_emb/inference/batch_handler.py", line 39, in <module> from infinity_emb.transformer.utils import get_lengths_with_tokenize File "/usr/local/lib/python3.10/dist-packages/infinity_emb/transformer/utils.py", line 9, in <module> from infinity_emb.transformer.classifier.torch import SentenceClassifier File "/usr/local/lib/python3.10/dist-packages/infinity_emb/transformer/classifier/torch.py", line 8, in <module> from infinity_emb.transformer.acceleration import ( File "/usr/local/lib/python3.10/dist-packages/infinity_emb/transformer/acceleration.py", line 11, in <module> from optimum.bettertransformer import ( # type: ignore[import-untyped] File "/usr/local/lib/python3.10/dist-packages/optimum/bettertransformer/__init__.py", line 20, in <module> raise RuntimeError( RuntimeError: BetterTransformer requires transformers<4.49 but found 4.51.3. optimum.bettertransformer is deprecated and will be removed in optimum v2.0.. However even with latest optimum version the issue persists.
I think this is the problem: https://github.com/huggingface/optimum/blob/e15053d33e60f42bb87389a869c3a9d823ea972f/optimum/bettertransformer/init.py#L19. Can this be updated to cover newer versions of transformers?
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction (minimal, reproducible, runnable)
Install infinity and force newes version of optimum: pip install optimum>=1.25.3 infinity-emb[server,torch,optimum,einops,cache]. Then try to run Infinity with infinity_emb v2 --model-id BAAI/bge-small-en-v1.5
Expected behavior
I expect Infinity to start without a runtime error when using optimum 1.25.3 and transformers 4.51.3
True, so far v1.24.0 of optimum is pinned - will take a look.
Bettertransformers has been deprecated for a couple versions now, transformers already implements SDPA and other attention implementations like Flash Attention v1/v2/v3.
I still think this is a bug. If something is deprecated it should emit DeprecationWarning, not raise RuntimeError.
Dirty one-liner fix (with sharkdp/fd):
sed -i 's/raise RuntimeError/print/g' $(fd -HI __init__.py | grep 'bettertransformer/_')
we are removing it (bettertransformer) in next version ๐ค
please use transformers' attention implementation: https://huggingface.co/docs/transformers/main/en/llm_optims#attention and torch.compile (with static cache if decoder): https://huggingface.co/docs/transformers/main/en/llm_optims#static-kv-cache-and-torchcompile for the best possible performance (exceeding bettertransformer, which no one maintains! ๐).
This might come across as nitpicky...why why is the documentation telling me to use a deprecated feature?
"For comparison, letโs run the same function, but enable Flash Attention instead. To do so, we convert the model to BetterTransformer and by doing so enabling PyTorchโs which in turn is able to use Flash Attention.
model.to_bettertransformer()"
Source: https://huggingface.co/docs/transformers/main/en/llm_tutorial_optimization