infinity mixedbread-ai/mxbai-embed-large-v1 model deployment problem

System Info

Infinity latest CPU Image.

Information

[x] Docker + cli
[ ] pip + cli
[ ] pip + usage of Python interface

Tasks

[ ] An officially supported CLI command
[ ] My own modifications

Reproduction

I have used Infinity-CPU image for mixedbread-ai/mxbai-embed-large-v1 model deployment, although this model already supported ONNX, I can't deploy it with optimum engine. From checking the logs, I realize that everything still fine until it didn't launch the Server (FastAPI Application). I also attach my deployment's logs

docker run -it \
                                  > -v $volume:/app/.cache \
                                                                    > -p $port:$port \
                                                                                                      > michaelf34/infinity:latest-cpu \
                                                                                                                                        > v2 \
                                                                                                                                                                          > --engine optimum \
                    > --model-id $model1 \
                                                      > --port $port
Unable to find image 'michaelf34/infinity:latest-cpu' locally
latest-cpu: Pulling from michaelf34/infinity
6414378b6477: Already exists 
e9e8b1fb810f: Pull complete 
ef3f4bb329e2: Pull complete 
9ec2b46f1868: Pull complete 
25bdf87e7cbe: Pull complete 
Digest: sha256:791e6b8a4eab6ed1bdea40358f6ce43cde90824405101d62b41aaf516fb46f54
Status: Downloaded newer image for michaelf34/infinity:latest-cpu

INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO     2025-03-04 02:29:03,350 infinity_emb INFO: Creating 1engines: engines=['mixedbread-ai/mxbai-embed-large-v1']                                              infinity_server.py:84
INFO     2025-03-04 02:29:03,355 infinity_emb INFO: Anonymized telemetry can be disabled via environment variable `DO_NOT_TRACK=1`.                                      telemetry.py:30
INFO     2025-03-04 02:29:03,368 infinity_emb INFO: model=`mixedbread-ai/mxbai-embed-large-v1` selected, using engine=`optimum` and device=`None`                     select_model.py:64
INFO     2025-03-04 02:29:03,564 infinity_emb INFO: Found 3 onnx files: [PosixPath('onnx/model.onnx'), PosixPath('onnx/model_fp16.onnx'),                           utils_optimum.py:244
         PosixPath('onnx/model_quantized.onnx')]                                                                                                                                        
INFO     2025-03-04 02:29:03,566 infinity_emb INFO: Using onnx/model_quantized.onnx as the model                                                                    utils_optimum.py:248
INFO     2025-03-04 02:29:03,572 infinity_emb INFO: files_optimized: []                                                                                             utils_optimum.py:146
The ONNX file onnx/model_quantized.onnx is not a regular name used in optimum.onnxruntime, the ORTModel might not behave as expected.
INFO     2025-03-04 02:29:08,144 infinity_emb INFO: Optimizing model                                                                                                utils_optimum.py:168
/app/.venv/lib/python3.11/site-packages/optimum/onnxruntime/configuration.py:779: FutureWarning: optimize_with_onnxruntime_only will be deprecated soon, use enable_transformers_specific_optimizations instead, enable_transformers_specific_optimizations is set to True.
  warnings.warn(
/app/.venv/lib/python3.11/site-packages/optimum/onnxruntime/configuration.py:779: FutureWarning: disable_embed_layer_norm will be deprecated soon, use disable_embed_layer_norm_fusion instead, disable_embed_layer_norm_fusion is set to True.
  warnings.warn(
2025-03-04 02:29:12.502006450 [W:onnxruntime:, inference_session.cc:2048 Initialize] Serializing optimized model with Graph Optimization level greater than ORT_ENABLE_EXTENDED and the NchwcTransformer enabled. The generated model may contain hardware specific optimizations, and should only be used in the same environment the model was optimized in.
WARNING  2025-03-04 02:29:13,677 onnxruntime.transformers.optimizer WARNING: Model producer not matched: Expected "pytorch", Got "onnx.quantize".Please specify correct optimizer.py:250
         --model_type parameter.                                                                                                                                                        
WARNING  2025-03-04 02:29:14,630 onnx_model WARNING: shape inference failed which might impact useless cast node detection.                                            onnx_model.py:670
WARNING  2025-03-04 02:29:15,370 fusion_skiplayernorm WARNING: symbolic shape inference disabled or failed.                                                   fusion_skiplayernorm.py:35
WARNING  2025-03-04 02:29:16,059 fusion_skiplayernorm WARNING: symbolic shape inference disabled or failed.                                                   fusion_skiplayernorm.py:35
INFO     2025-03-04 02:29:17,359 onnx_model_bert INFO: opset version: 11                                                                                          onnx_model_bert.py:405
INFO     2025-03-04 02:29:17,417 onnx_model INFO: Sort graphs in topological order                                                                                    onnx_model.py:1222
INFO     2025-03-04 02:29:18,874 onnx_model INFO: Model saved to                                                                                                      onnx_model.py:1230
         /app/.cache/huggingface/hub/infinity_onnx/OpenVINOExecutionProvider/mixedbread-ai/mxbai-embed-large-v1/model_quantized_optimized.onnx                                          
The ONNX file model_quantized_optimized.onnx is not a regular name used in optimum.onnxruntime, the ORTModel might not behave as expected.
2025-03-04 02:29:24.002689369 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2025-03-04 02:29:24.002864975 [W:onnxruntime:, session_state.cc:1170 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
INFO     2025-03-04 02:29:26,671 infinity_emb INFO: Getting timings for batch_size=32 and avg tokens per sentence=3                                                   select_model.py:97
                 0.93     ms tokenization                                                                                                                                               
                 292.85   ms inference                                                                                                                                                  
                 0.25     ms post-processing                                                                                                                                            
                 294.03   ms total                                                                                                                                                      
         embeddings/sec: 108.83

Mar 04 '25 03:03 hungsvdut2k2

Same issue for mixedbread-ai/mxbai-rerank-base-v1. However mixedbread-ai/mxbai-rerank-xsmall-v1 seems to be working fine.

Mar 05 '25 17:03 ankit-personatech

Looks like an out of cpu memory to me.

Mar 05 '25 17:03 michaelfeil