System Info
docker run -it --gpus all -v E:\infinity:/models -p 8081:8081 michaelf34/infinity:latest v2 --model-id "/models/jinaai/jina-reranker-v2-base-multilingual" --port 8081
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO 2025-05-21 02:32:46,009 infinity_emb INFO: Creating 1engines: infinity_server.py:84
engines=['jinaai/jina-reranker-v2-base-multilingual']
INFO 2025-05-21 02:32:46,024 infinity_emb INFO: Anonymized telemetry can be disabled via environment telemetry.py:30
variable DO_NOT_TRACK=1.
INFO 2025-05-21 02:32:46,054 infinity_emb INFO: select_model.py:64
model=/models/jinaai/jina-reranker-v2-base-multilingual selected, using engine=torch and
device=None
/app/.venv/lib/python3.10/site-packages/flash_attn/ops/triton/layer_norm.py:985: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
def forward(
/app/.venv/lib/python3.10/site-packages/flash_attn/ops/triton/layer_norm.py:1044: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
def backward(ctx, dout, *args):
ERROR: Traceback (most recent call last):
File "/app/.venv/lib/python3.10/site-packages/starlette/routing.py", line 693, in lifespan
async with self.lifespan_context(app) as maybe_state:
File "/usr/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/app/infinity_emb/infinity_server.py", line 88, in lifespan
app.engine_array = AsyncEngineArray.from_args(engine_args_list) # type: ignore
File "/app/infinity_emb/engine.py", line 306, in from_args
return cls(engines=tuple(engines))
File "/app/infinity_emb/engine.py", line 71, in from_args
engine = cls(**engine_args.to_dict(), _show_deprecation_warning=False)
File "/app/infinity_emb/engine.py", line 56, in init
self._model_replicas, self._min_inference_t, self._max_inference_t = select_model(
File "/app/infinity_emb/inference/select_model.py", line 88, in select_model
min(loaded_engine.warmup(batch_size=1, n_tokens=1)[1] for _ in range(10)),
File "/app/infinity_emb/inference/select_model.py", line 88, in
min(loaded_engine.warmup(batch_size=1, n_tokens=1)[1] for _ in range(10)),
File "/app/infinity_emb/transformer/abstract.py", line 220, in warmup
return run_warmup(self, inp)
File "/app/infinity_emb/transformer/abstract.py", line 233, in run_warmup
embed = model.encode_core(feat)
File "/app/infinity_emb/transformer/crossencoder/torch.py", line 106, in encode_core
out_features = self.model(**features, return_dict=True)["logits"]
File "/app/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/app/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/app/.cache/huggingface/modules/transformers_modules/jina-reranker-v2-base-multilingual/modeling_xlm_roberta.py", line 854, in forward
outputs = self.roberta(
File "/app/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/app/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/app/.cache/huggingface/modules/transformers_modules/jina-reranker-v2-base-multilingual/modeling_xlm_roberta.py", line 664, in forward
sequence_output = self.encoder(
File "/app/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/app/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/app/.cache/huggingface/modules/transformers_modules/jina-reranker-v2-base-multilingual/modeling_xlm_roberta.py", line 231, in forward
hidden_states = layer(hidden_states, mixer_kwargs=mixer_kwargs)
File "/app/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/app/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/app/.cache/huggingface/modules/transformers_modules/jina-reranker-v2-base-multilingual/block.py", line 260, in forward
mixer_out = self.mixer(
File "/app/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/app/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/app/.cache/huggingface/modules/transformers_modules/jina-reranker-v2-base-multilingual/mha.py", line 605, in forward
context = self.inner_attn(qkv, **kwargs)
File "/app/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/app/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/app/.cache/huggingface/modules/transformers_modules/jina-reranker-v2-base-multilingual/mha.py", line 84, in forward
return flash_attn_varlen_qkvpacked_func(
File "/app/.venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 1267, in flash_attn_varlen_qkvpacked_func
return FlashAttnVarlenQKVPackedFunc.apply(
File "/app/.venv/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/app/.venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 553, in forward
out_padded, softmax_lse, S_dmask, rng_state = _wrapped_flash_attn_varlen_forward(
File "/app/.venv/lib/python3.10/site-packages/torch/_ops.py", line 1123, in call
return self._op(*args, **(kwargs or {}))
File "/app/.venv/lib/python3.10/site-packages/torch/_library/autograd.py", line 113, in autograd_impl
result = forward_no_grad(*args, Metadata(keyset, keyword_only_args))
File "/app/.venv/lib/python3.10/site-packages/torch/_library/autograd.py", line 40, in forward_no_grad
result = op.redispatch(keyset & _C._after_autograd_keyset, *args, **kwargs)
File "/app/.venv/lib/python3.10/site-packages/torch/_ops.py", line 728, in redispatch
return self._handle.redispatch_boxed(keyset, *args, **kwargs)
File "/app/.venv/lib/python3.10/site-packages/torch/_library/custom_ops.py", line 305, in backend_impl
result = self._backend_fns[device_type](*args, **kwargs)
File "/app/.venv/lib/python3.10/site-packages/torch/_compile.py", line 32, in inner
return disable_fn(*args, **kwargs)
File "/app/.venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
return fn(*args, **kwargs)
File "/app/.venv/lib/python3.10/site-packages/torch/_library/custom_ops.py", line 337, in wrapped_fn
return fn(*args, **kwargs)
File "/app/.venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 170, in _flash_attn_varlen_forward
out, softmax_lse, S_dmask, rng_state = flash_attn_gpu.varlen_fwd(
RuntimeError: FlashAttention only supports Ampere GPUs or newer.
Information
- [x] Docker + cli
- [ ] pip + cli
- [ ] pip + usage of Python interface
Tasks
- [ ] An officially supported CLI command
- [ ] My own modifications
Reproduction

docker run -it --gpus all -v E:\infinity:/models -p 8081:8081 michaelf34/infinity:latest v2 --no-bettertransformer --model-id "/models/jinaai/jina-reranker-v2-base-multilingual" --port 8081
add --no-bettertransformer , but the above error is still not resolved