text-embeddings-inference Error occurs when using ONNX model with text-embeddings-inference turing image

Error occurs when using ONNX model with text-embeddings-inference turing image

Open gogomasaru opened this issue 7 months ago • 2 comments

System Info

offline and airgapped ENV OS version: rhel8.19 Model: bge-m3 Hardware: NVIDIA GPU T4 Deployment: Kubernetes (kserve) Current version: turing-1.6

Information

[x] Docker
[ ] The CLI directly

Tasks

[ ] An officially supported command
[x] My own modifications

Reproduction

When I serve the BGE-M3 model using kserve with the turing-1.6 image and pytorch_model.bin, it works normally. image: turing-1.6 pytorch_model.bin

YAML : apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: gembed namespace: kserve spec: predictor: tolerations: - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" containers: - args: - '--model-id' - /data env: - name: HUGGINGFACE_HUB_CACHE value: /data image: ghcr.io/huggingface/text-embeddings-inference:turing-latest imagePullPolicy: IfNotPresent name: kserve-container ports: - containerPort: 8080 protocol: TCP resources: limits: cpu: '1' memory: 4Gi nvidia.com/gpu: '1' requests: cpu: '1' memory: 1Gi nvidia.com/gpu: '1' volumeMounts: - name: gembed-onnx-volume mountPath: /data maxReplicas: 1 minReplicas: 1 volumes: - name: gembed-onnx-volume persistentVolumeClaim: claimName: gembed-onnx-pv-claim

However, when I switch to the ONNX model, I get the following error: k logs -f gembed-predictor-00002-deployment-7d4b8d6f67-8g9xf -n kserve 2025-03-27T11:58:40.694775Z INFO text_embeddings_router: router/src/main.rs:185: Args { model_id: "/****", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hf_token: None, hostname: "gembed-predictor-00002-deployment-7d4b8d6f67-8g9xf", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None } 2025-03-27T11:58:40.698252Z WARN text_embeddings_router: router/src/lib.rs:403: The --pooling arg is not set and we could not find a pooling configuration (1_Pooling/config.json) for this model but the model is a BERT variant. Defaulting to CLS pooling. 2025-03-27T11:58:41.365892Z WARN text_embeddings_router: router/src/lib.rs:188: Could not find a Sentence Transformers config 2025-03-27T11:58:41.365911Z INFO text_embeddings_router: router/src/lib.rs:192: Maximum number of tokens per request: 8192 2025-03-27T11:58:41.366116Z INFO text_embeddings_core::tokenization: core/src/tokenization.rs:28: Starting 1 tokenization workers 2025-03-27T11:58:41.864311Z INFO text_embeddings_router: router/src/lib.rs:234: Starting model backend 2025-03-27T11:58:41.865066Z ERROR text_embeddings_backend: backends/src/lib.rs:388: Could not start Candle backend: Could not start backend: No such file or directory (os error 2) Error: Could not create backend

3.When I change the image to cpu-1.6 and test it, it works normally. 2025-03-27T14:11:33.231208Z INFO text_embeddings_router: router/src/main.rs:175: Args { model_id: "/****", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: "gembed-predictor-00001-deployment-56ccb599cf-gzjp8", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None } 2025-03-27T14:11:33.237362Z WARN text_embeddings_router: router/src/lib.rs:392: The --pooling arg is not set and we could not find a pooling configuration (1_Pooling/config.json) for this model but the model is a BERT variant. Defaulting to CLS pooling. 2025-03-27T14:11:33.897769Z WARN text_embeddings_router: router/src/lib.rs:184: Could not find a Sentence Transformers config 2025-03-27T14:11:33.897784Z INFO text_embeddings_router: router/src/lib.rs:188: Maximum number of tokens per request: 8192 2025-03-27T14:11:33.898748Z INFO text_embeddings_core::tokenization: core/src/tokenization.rs:28: Starting 1 tokenization workers 2025-03-27T14:11:34.405665Z INFO text_embeddings_router: router/src/lib.rs:230: Starting model backend 2025-03-27T14:11:40.755400Z WARN text_embeddings_router: router/src/lib.rs:258: Backend does not support a batch size > 8 2025-03-27T14:11:40.755416Z WARN text_embeddings_router: router/src/lib.rs:259: forcing max_batch_requests=8 2025-03-27T14:11:40.755519Z WARN text_embeddings_router: router/src/lib.rs:310: Invalid hostname, defaulting to 0.0.0.0 2025-03-27T14:11:40.757444Z INFO text_embeddings_router::http::server: router/src/http/server.rs:1812: Starting HTTP server: 0.0.0.0:8080 2025-03-27T14:11:40.757456Z INFO text_embeddings_router::http::server: router/src/http/server.rs:1813: Ready

When I test with the turing-latest image, I get the same error as with turing-1.6

Expected behavior

I'm not sure if the issue is with the Turing image or my configuration.

Mar 27 '25 14:03 gogomasaru

text-embeddings-inference text-embeddings-inference copied to clipboard

Error occurs when using ONNX model with text-embeddings-inference turing image

System Info

Information

Tasks

Reproduction

Expected behavior

text-embeddings-inference
text-embeddings-inference copied to clipboard