infinity
infinity copied to clipboard
mixedbread-ai/mxbai-embed-large-v1 model deployment problem
System Info
- Infinity latest CPU Image.
Information
- [x] Docker + cli
- [ ] pip + cli
- [ ] pip + usage of Python interface
Tasks
- [ ] An officially supported CLI command
- [ ] My own modifications
Reproduction
I have used Infinity-CPU image for mixedbread-ai/mxbai-embed-large-v1 model deployment, although this model already supported ONNX, I can't deploy it with optimum engine. From checking the logs, I realize that everything still fine until it didn't launch the Server (FastAPI Application). I also attach my deployment's logs
docker run -it \
> -v $volume:/app/.cache \
> -p $port:$port \
> michaelf34/infinity:latest-cpu \
> v2 \
> --engine optimum \
> --model-id $model1 \
> --port $port
Unable to find image 'michaelf34/infinity:latest-cpu' locally
latest-cpu: Pulling from michaelf34/infinity
6414378b6477: Already exists
e9e8b1fb810f: Pull complete
ef3f4bb329e2: Pull complete
9ec2b46f1868: Pull complete
25bdf87e7cbe: Pull complete
Digest: sha256:791e6b8a4eab6ed1bdea40358f6ce43cde90824405101d62b41aaf516fb46f54
Status: Downloaded newer image for michaelf34/infinity:latest-cpu
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO 2025-03-04 02:29:03,350 infinity_emb INFO: Creating 1engines: engines=['mixedbread-ai/mxbai-embed-large-v1'] infinity_server.py:84
INFO 2025-03-04 02:29:03,355 infinity_emb INFO: Anonymized telemetry can be disabled via environment variable `DO_NOT_TRACK=1`. telemetry.py:30
INFO 2025-03-04 02:29:03,368 infinity_emb INFO: model=`mixedbread-ai/mxbai-embed-large-v1` selected, using engine=`optimum` and device=`None` select_model.py:64
INFO 2025-03-04 02:29:03,564 infinity_emb INFO: Found 3 onnx files: [PosixPath('onnx/model.onnx'), PosixPath('onnx/model_fp16.onnx'), utils_optimum.py:244
PosixPath('onnx/model_quantized.onnx')]
INFO 2025-03-04 02:29:03,566 infinity_emb INFO: Using onnx/model_quantized.onnx as the model utils_optimum.py:248
INFO 2025-03-04 02:29:03,572 infinity_emb INFO: files_optimized: [] utils_optimum.py:146
The ONNX file onnx/model_quantized.onnx is not a regular name used in optimum.onnxruntime, the ORTModel might not behave as expected.
INFO 2025-03-04 02:29:08,144 infinity_emb INFO: Optimizing model utils_optimum.py:168
/app/.venv/lib/python3.11/site-packages/optimum/onnxruntime/configuration.py:779: FutureWarning: optimize_with_onnxruntime_only will be deprecated soon, use enable_transformers_specific_optimizations instead, enable_transformers_specific_optimizations is set to True.
warnings.warn(
/app/.venv/lib/python3.11/site-packages/optimum/onnxruntime/configuration.py:779: FutureWarning: disable_embed_layer_norm will be deprecated soon, use disable_embed_layer_norm_fusion instead, disable_embed_layer_norm_fusion is set to True.
warnings.warn(
2025-03-04 02:29:12.502006450 [W:onnxruntime:, inference_session.cc:2048 Initialize] Serializing optimized model with Graph Optimization level greater than ORT_ENABLE_EXTENDED and the NchwcTransformer enabled. The generated model may contain hardware specific optimizations, and should only be used in the same environment the model was optimized in.
WARNING 2025-03-04 02:29:13,677 onnxruntime.transformers.optimizer WARNING: Model producer not matched: Expected "pytorch", Got "onnx.quantize".Please specify correct optimizer.py:250
--model_type parameter.
WARNING 2025-03-04 02:29:14,630 onnx_model WARNING: shape inference failed which might impact useless cast node detection. onnx_model.py:670
WARNING 2025-03-04 02:29:15,370 fusion_skiplayernorm WARNING: symbolic shape inference disabled or failed. fusion_skiplayernorm.py:35
WARNING 2025-03-04 02:29:16,059 fusion_skiplayernorm WARNING: symbolic shape inference disabled or failed. fusion_skiplayernorm.py:35
INFO 2025-03-04 02:29:17,359 onnx_model_bert INFO: opset version: 11 onnx_model_bert.py:405
INFO 2025-03-04 02:29:17,417 onnx_model INFO: Sort graphs in topological order onnx_model.py:1222
INFO 2025-03-04 02:29:18,874 onnx_model INFO: Model saved to onnx_model.py:1230
/app/.cache/huggingface/hub/infinity_onnx/OpenVINOExecutionProvider/mixedbread-ai/mxbai-embed-large-v1/model_quantized_optimized.onnx
The ONNX file model_quantized_optimized.onnx is not a regular name used in optimum.onnxruntime, the ORTModel might not behave as expected.
2025-03-04 02:29:24.002689369 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2025-03-04 02:29:24.002864975 [W:onnxruntime:, session_state.cc:1170 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
INFO 2025-03-04 02:29:26,671 infinity_emb INFO: Getting timings for batch_size=32 and avg tokens per sentence=3 select_model.py:97
0.93 ms tokenization
292.85 ms inference
0.25 ms post-processing
294.03 ms total
embeddings/sec: 108.83
Same issue for mixedbread-ai/mxbai-rerank-base-v1. However mixedbread-ai/mxbai-rerank-xsmall-v1 seems to be working fine.
Looks like an out of cpu memory to me.