text-embeddings-inference
text-embeddings-inference copied to clipboard
Error occurs when using ONNX model with text-embeddings-inference turing image
System Info
offline and airgapped ENV OS version: rhel8.19 Model: bge-m3 Hardware: NVIDIA GPU T4 Deployment: Kubernetes (kserve) Current version: turing-1.6
Information
- [x] Docker
- [ ] The CLI directly
Tasks
- [ ] An officially supported command
- [x] My own modifications
Reproduction
- When I serve the BGE-M3 model using kserve with the turing-1.6 image and pytorch_model.bin, it works normally. image: turing-1.6 pytorch_model.bin
YAML : apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: gembed namespace: kserve spec: predictor: tolerations: - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" containers: - args: - '--model-id' - /data env: - name: HUGGINGFACE_HUB_CACHE value: /data image: ghcr.io/huggingface/text-embeddings-inference:turing-latest imagePullPolicy: IfNotPresent name: kserve-container ports: - containerPort: 8080 protocol: TCP resources: limits: cpu: '1' memory: 4Gi nvidia.com/gpu: '1' requests: cpu: '1' memory: 1Gi nvidia.com/gpu: '1' volumeMounts: - name: gembed-onnx-volume mountPath: /data maxReplicas: 1 minReplicas: 1 volumes: - name: gembed-onnx-volume persistentVolumeClaim: claimName: gembed-onnx-pv-claim
- However, when I switch to the ONNX model, I get the following error:
k logs -f gembed-predictor-00002-deployment-7d4b8d6f67-8g9xf -n kserve
2025-03-27T11:58:40.694775Z INFO text_embeddings_router: router/src/main.rs:185: Args { model_id: "/****", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hf_token: None, hostname: "gembed-predictor-00002-deployment-7d4b8d6f67-8g9xf", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
2025-03-27T11:58:40.698252Z WARN text_embeddings_router: router/src/lib.rs:403: The
--poolingarg is not set and we could not find a pooling configuration (1_Pooling/config.json) for this model but the model is a BERT variant. Defaulting toCLSpooling. 2025-03-27T11:58:41.365892Z WARN text_embeddings_router: router/src/lib.rs:188: Could not find a Sentence Transformers config 2025-03-27T11:58:41.365911Z INFO text_embeddings_router: router/src/lib.rs:192: Maximum number of tokens per request: 8192 2025-03-27T11:58:41.366116Z INFO text_embeddings_core::tokenization: core/src/tokenization.rs:28: Starting 1 tokenization workers 2025-03-27T11:58:41.864311Z INFO text_embeddings_router: router/src/lib.rs:234: Starting model backend 2025-03-27T11:58:41.865066Z ERROR text_embeddings_backend: backends/src/lib.rs:388: Could not start Candle backend: Could not start backend: No such file or directory (os error 2) Error: Could not create backend
3.When I change the image to cpu-1.6 and test it, it works normally.
2025-03-27T14:11:33.231208Z INFO text_embeddings_router: router/src/main.rs:175: Args { model_id: "/****", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: "gembed-predictor-00001-deployment-56ccb599cf-gzjp8", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
2025-03-27T14:11:33.237362Z WARN text_embeddings_router: router/src/lib.rs:392: The --pooling arg is not set and we could not find a pooling configuration (1_Pooling/config.json) for this model but the model is a BERT variant. Defaulting to CLS pooling.
2025-03-27T14:11:33.897769Z WARN text_embeddings_router: router/src/lib.rs:184: Could not find a Sentence Transformers config
2025-03-27T14:11:33.897784Z INFO text_embeddings_router: router/src/lib.rs:188: Maximum number of tokens per request: 8192
2025-03-27T14:11:33.898748Z INFO text_embeddings_core::tokenization: core/src/tokenization.rs:28: Starting 1 tokenization workers
2025-03-27T14:11:34.405665Z INFO text_embeddings_router: router/src/lib.rs:230: Starting model backend
2025-03-27T14:11:40.755400Z WARN text_embeddings_router: router/src/lib.rs:258: Backend does not support a batch size > 8
2025-03-27T14:11:40.755416Z WARN text_embeddings_router: router/src/lib.rs:259: forcing max_batch_requests=8
2025-03-27T14:11:40.755519Z WARN text_embeddings_router: router/src/lib.rs:310: Invalid hostname, defaulting to 0.0.0.0
2025-03-27T14:11:40.757444Z INFO text_embeddings_router::http::server: router/src/http/server.rs:1812: Starting HTTP server: 0.0.0.0:8080
2025-03-27T14:11:40.757456Z INFO text_embeddings_router::http::server: router/src/http/server.rs:1813: Ready
- When I test with the turing-latest image, I get the same error as with turing-1.6
Expected behavior
I'm not sure if the issue is with the Turing image or my configuration.