python engine failed with BGE-VL-base and optimum
System Info
INFO 2025-03-13 13:02:38,529 datasets INFO: PyTorch version 2.4.1 available. config.py:54
Usage: infinity_emb v2 [OPTIONS]
Infinity API ♾️ cli v2. MIT License. Copyright (c) 2023-now Michael Feil
Multiple Model CLI Playbook:
-
- cli options can be overloaded i.e.
v2 --model-id model/id1 --model-id model/id2 --batch-size 8 --batch-size 4
- cli options can be overloaded i.e.
-
- or adapt the defaults by setting ENV Variables separated by
;: INFINITY_MODEL_ID="model/id1;model/id2;" && INFINITY_BATCH_SIZE="8;4;"
- or adapt the defaults by setting ENV Variables separated by
-
- single items are broadcasted to
--model-idlength, makingv2 --model-id model/id1 --model-id/id2 --batch-size 8both models have batch-size 8.
- single items are broadcasted to
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --model-id TEXT Huggingface model repo id. Subset of possible models: │
│ https://huggingface.co/models?other=text-embeddings-infere… │
│ [env var: INFINITY_MODEL_ID] │
│ [default: michaelfeil/bge-small-en-v1.5] │
│ --served-model-name TEXT the nickname for the API, under which the model_id can be │
│ selected │
│ [env var: INFINITY_SERVED_MODEL_NAME] │
│ --batch-size INTEGER maximum batch size for inference │
│ [env var: INFINITY_BATCH_SIZE] │
│ [default: 32] │
│ --revision TEXT huggingface model repo revision. │
│ [env var: INFINITY_REVISION] │
│ --trust-remote-code --no-trust-remote-code if potential remote modeling code from huggingface repo is │
│ trusted. │
│ [env var: INFINITY_TRUST_REMOTE_CODE] │
│ [default: trust-remote-code] │
│ --engine [torch|ctranslate2|optimum|neuron|debugengine] Which backend to use. torch uses Pytorch GPU/CPU, optimum │
│ uses ONNX on GPU/CPU/NVIDIA-TensorRT, CTranslate2 uses │
│ torch+ctranslate2 on CPU/GPU. │
│ [env var: INFINITY_ENGINE] │
│ [default: torch] │
│ --model-warmup --no-model-warmup if model should be warmed up after startup, and before │
│ ready. │
│ [env var: INFINITY_MODEL_WARMUP] │
│ [default: model-warmup] │
│ --vector-disk-cache --no-vector-disk-cache If hash(request)/results should be cached to SQLite for │
│ latency improvement. │
│ [env var: INFINITY_VECTOR_DISK_CACHE] │
│ [default: vector-disk-cache] │
│ --device [cpu|cuda|mps|tensorrt|auto] device to use for computing the model forward pass. │
│ [env var: INFINITY_DEVICE] │
│ [default: auto] │
│ --device-id TEXT device id defines the model placement. e.g. 0,1 will │
│ place the model on MPS/CUDA/GPU 0 and 1 each │
│ [env var: INFINITY_DEVICE_ID] │
│ --lengths-via-tokenize --no-lengths-via-tokenize if True, returned tokens is based on actual tokenizer │
│ count. If false, uses len(input) as proxy. │
│ [env var: INFINITY_LENGTHS_VIA_TOKENIZE] │
│ [default: lengths-via-tokenize] │
│ --dtype [float32|float16|bfloat16|int8|fp8|auto] dtype for the model weights. [env var: INFINITY_DTYPE] │
│ [default: auto] │
│ --embedding-dtype [float32|int8|uint8|binary|ubinary] dtype post-forward pass. If != float32, using │
│ Post-Forward Static quantization. │
│ [env var: INFINITY_EMBEDDING_DTYPE] │
│ [default: float32] │
│ --pooling-method [mean|cls|auto] overwrite the pooling method if inferred incorrectly. │
│ [env var: INFINITY_POOLING_METHOD] │
│ [default: auto] │
│ --compile --no-compile Enable usage of torch.compile(dynamic=True) if engine │
│ relies on it. │
│ [env var: INFINITY_COMPILE] │
│ [default: compile] │
│ --bettertransformer --no-bettertransformer Enables varlen flash-attention-2 via the │
│ BetterTransformer implementation. If available for this │
│ model. │
│ [env var: INFINITY_BETTERTRANSFORMER] │
│ [default: bettertransformer] │
│ --preload-only --no-preload-only If true, only downloads models and verifies setup, then │
│ exit. Recommended for pre-caching the download in a │
│ Dockerfile. │
│ [env var: INFINITY_PRELOAD_ONLY] │
│ [default: no-preload-only] │
│ --host TEXT host for the FastAPI uvicorn server │
│ [env var: INFINITY_HOST] │
│ [default: 0.0.0.0] │
│ --port INTEGER port for the FastAPI uvicorn server │
│ [env var: INFINITY_PORT] │
│ [default: 7997] │
│ --url-prefix TEXT prefix for all routes of the FastAPI uvicorn server. Useful │
│ if you run behind a proxy / cascaded API. │
│ [env var: INFINITY_URL_PREFIX] │
│ --redirect-slash TEXT where to redirect / requests to. │
│ [env var: INFINITY_REDIRECT_SLASH] │
│ [default: /docs] │
│ --log-level [critical|error|warning|info|debug|trace] console log level. [env var: INFINITY_LOG_LEVEL] │
│ [default: info] │
│ --permissive-cors --no-permissive-cors whether to allow permissive cors. │
│ [env var: INFINITY_PERMISSIVE_CORS] │
│ [default: no-permissive-cors] │
│ --api-key TEXT api_key used for authentication headers. │
│ [env var: INFINITY_API_KEY] │
│ --proxy-root-path TEXT Proxy prefix for the application. See: │
│ https://fastapi.tiangolo.com/advanced/behind-a-proxy/ │
│ [env var: INFINITY_PROXY_ROOT_PATH] │
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Information
- [ ] Docker + cli
- [ ] pip + cli
- [ ] pip + usage of Python interface
Tasks
- [ ] An officially supported CLI command
- [ ] My own modifications
Reproduction
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
sentences = ["This is awesome.", "I am bored."]
images = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
engine_args = EngineArgs(
model_name_or_path="/mnt/bn/bs-llm-data/mlx/users/xiongweixuan.xwx/playground/model/BGE-VL-base",
engine="optimum",
device="cuda"
)
array = AsyncEngineArray.from_args([engine_args])
async def embed(engine: AsyncEmbeddingEngine):
await engine.astart()
embeddings, usage = await engine.embed(sentences=sentences)
embeddings_image, _ = await engine.image_embed(images=images)
await engine.astop()
return embeddings, embeddings_image
if __name__ == "__main__":
# asyncio.run(embed(array[engine_args.model_name_or_path]))
embedding, embeddings_image = await embed(array[engine_args.model_name_or_path])
print("embedding:", embedding)
print("embeddings_image:", embeddings_image)
with error report:
NotImplementedError Traceback (most recent call last) Cell In[1], line 11 5 images = ["http://images.cocodataset.org/val2017/000000039769.jpg"] 6 engine_args = EngineArgs( 7 model_name_or_path="/mnt/bn/bs-llm-data/mlx/users/xiongweixuan.xwx/playground/model/BGE-VL-base", 8 engine="optimum", 9 device="cuda" 10 ) ---> 11 array = AsyncEngineArray.from_args([engine_args]) 13 async def embed(engine: AsyncEmbeddingEngine): 14 await engine.astart()
File ~/.local/lib/python3.11/site-packages/infinity_emb/engine.py:306, in AsyncEngineArray.from_args(cls, engine_args_array) 299 """create an engine from EngineArgs 300 301 Args: 302 engine_args_array (list[EngineArgs]): EngineArgs object 303 """ 304 engines = map(AsyncEmbeddingEngine.from_args, engine_args_array) --> 306 return cls(engines=tuple(engines))
File ~/.local/lib/python3.11/site-packages/infinity_emb/engine.py:71, in AsyncEmbeddingEngine.from_args(cls, engine_args)
65 """create an engine from EngineArgs
66
67 Args:
68 engine_args (EngineArgs): EngineArgs object
69 """
70 logger.debug("Creating AsyncEmbeddingEngine from %s", engine_args)
---> 71 engine = cls(**engine_args.to_dict(), _show_deprecation_warning=False)
73 return engine
File ~/.local/lib/python3.11/site-packages/infinity_emb/engine.py:56, in AsyncEmbeddingEngine.init(self, model_name_or_path, _show_deprecation_warning, **kwargs) 54 self.running = False 55 self._running_sepamore: Optional[Semaphore] = None ---> 56 self._model_replicas, self._min_inference_t, self._max_inference_t = select_model( 57 self._engine_args 58 )
File ~/.local/lib/python3.11/site-packages/infinity_emb/inference/select_model.py:71, in select_model(engine_args)
64 logger.info(
65 f"model={engine_args.model_name_or_path} selected, "
66 f"using engine={engine_args.engine.value}"
67 f" and device={engine_args.device.resolve()}"
68 )
69 # engine_args.update_loading_strategy()
---> 71 unloaded_engine = get_engine_type_from_config(engine_args)
73 engine_replicas = []
74 min_inference_t = 4e-3
File ~/.local/lib/python3.11/site-packages/infinity_emb/inference/select_model.py:52, in get_engine_type_from_config(engine_args) 50 return PredictEngine.from_inference_engine(engine_args.engine) 51 if config.get("vision_config"): ---> 52 return ImageEmbedEngine.from_inference_engine(engine_args.engine) 53 if config.get("audio_config") and "clap" in config.get("model_type", "").lower(): 54 return AudioEmbedEngine.from_inference_engine(engine_args.engine)
File ~/.local/lib/python3.11/site-packages/infinity_emb/transformer/utils.py:75, in ImageEmbedEngine.from_inference_engine(engine) 73 return ImageEmbedEngine.torch 74 else: ---> 75 raise NotImplementedError(f"ImageEmbedEngine for {engine} not implemented")
NotImplementedError: ImageEmbedEngine for InferenceEngine.optimum not implemented
How can i use python engine with onnx and tensorrt engine?