optimum icon indicating copy to clipboard operation
optimum copied to clipboard

bettertransformer throws RuntimeError with transformers>4.49

Open mtrmarko opened this issue 7 months ago โ€ข 3 comments

System Info

Trying to use latest optimum v1.25.3 with Infinity embedding server 0.0.76 and transformers v4.51.3 inside an Ubuntu based Docker image with torch 2.7.0.

Who can help?

Per #2262 support for transformers>=4.51 was implemented. I hoped that would resolve my RuntimeError I get when starting Infinity embedding server. Specifically I see INFO 2025-05-22 03:07:15,886 datasets INFO: PyTorch version config.py:54 2.7.0 available. Traceback (most recent call last): File "/usr/local/bin/infinity_emb", line 5, in <module> from infinity_emb.cli import cli File "/usr/local/lib/python3.10/dist-packages/infinity_emb/__init__.py", line 27, in <module> from infinity_emb.engine import AsyncEmbeddingEngine, AsyncEngineArray # noqa: E402 File "/usr/local/lib/python3.10/dist-packages/infinity_emb/engine.py", line 11, in <module> from infinity_emb.inference import ( File "/usr/local/lib/python3.10/dist-packages/infinity_emb/inference/__init__.py", line 4, in <module> from infinity_emb.inference.batch_handler import BatchHandler File "/usr/local/lib/python3.10/dist-packages/infinity_emb/inference/batch_handler.py", line 39, in <module> from infinity_emb.transformer.utils import get_lengths_with_tokenize File "/usr/local/lib/python3.10/dist-packages/infinity_emb/transformer/utils.py", line 9, in <module> from infinity_emb.transformer.classifier.torch import SentenceClassifier File "/usr/local/lib/python3.10/dist-packages/infinity_emb/transformer/classifier/torch.py", line 8, in <module> from infinity_emb.transformer.acceleration import ( File "/usr/local/lib/python3.10/dist-packages/infinity_emb/transformer/acceleration.py", line 11, in <module> from optimum.bettertransformer import ( # type: ignore[import-untyped] File "/usr/local/lib/python3.10/dist-packages/optimum/bettertransformer/__init__.py", line 20, in <module> raise RuntimeError( RuntimeError: BetterTransformer requires transformers<4.49 but found 4.51.3. optimum.bettertransformer is deprecated and will be removed in optimum v2.0.. However even with latest optimum version the issue persists.

I think this is the problem: https://github.com/huggingface/optimum/blob/e15053d33e60f42bb87389a869c3a9d823ea972f/optimum/bettertransformer/init.py#L19. Can this be updated to cover newer versions of transformers?

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

Install infinity and force newes version of optimum: pip install optimum>=1.25.3 infinity-emb[server,torch,optimum,einops,cache]. Then try to run Infinity with infinity_emb v2 --model-id BAAI/bge-small-en-v1.5

Expected behavior

I expect Infinity to start without a runtime error when using optimum 1.25.3 and transformers 4.51.3

mtrmarko avatar May 22 '25 04:05 mtrmarko

True, so far v1.24.0 of optimum is pinned - will take a look.

michaelfeil avatar May 22 '25 15:05 michaelfeil

Bettertransformers has been deprecated for a couple versions now, transformers already implements SDPA and other attention implementations like Flash Attention v1/v2/v3.

IlyasMoutawwakil avatar May 25 '25 20:05 IlyasMoutawwakil

I still think this is a bug. If something is deprecated it should emit DeprecationWarning, not raise RuntimeError.

goatsweater avatar May 28 '25 18:05 goatsweater

Dirty one-liner fix (with sharkdp/fd):

sed -i 's/raise RuntimeError/print/g' $(fd -HI __init__.py | grep 'bettertransformer/_')

fanyang89 avatar Jun 25 '25 10:06 fanyang89

we are removing it (bettertransformer) in next version ๐Ÿค—

IlyasMoutawwakil avatar Jun 25 '25 10:06 IlyasMoutawwakil

please use transformers' attention implementation: https://huggingface.co/docs/transformers/main/en/llm_optims#attention and torch.compile (with static cache if decoder): https://huggingface.co/docs/transformers/main/en/llm_optims#static-kv-cache-and-torchcompile for the best possible performance (exceeding bettertransformer, which no one maintains! ๐Ÿ’€).

IlyasMoutawwakil avatar Jun 25 '25 10:06 IlyasMoutawwakil

This might come across as nitpicky...why why is the documentation telling me to use a deprecated feature?

"For comparison, letโ€™s run the same function, but enable Flash Attention instead. To do so, we convert the model to BetterTransformer and by doing so enabling PyTorchโ€™s which in turn is able to use Flash Attention. model.to_bettertransformer()"

Source: https://huggingface.co/docs/transformers/main/en/llm_tutorial_optimization

Klaws-- avatar Sep 10 '25 17:09 Klaws--