[Bug] Multilingual-e5-base Embeddings Issue with llama-embeddings Backend on CUDA 12 Docker (Windows 11)
Problem Description:
I am attempting to deploy the multilingual-e5-base embedding model for local inference on Windows 11 using LocalAI via Docker Compose with NVIDIA GPU acceleration (RTX 1660 SUPER, CUDA 12).
Despite configuring the model via a YAML file and manually placing a compatible GGUF file, I encounter inconsistent behavior depending on how the model is referenced in the API call.
- When calling the embeddings API using the model name specified in the YAML (
multilingual-e5-base), the request fails with abackend not founderror, specifically referencingllama-embeddings. - When calling the embeddings API directly using the GGUF filename (
multilingual-e5-base-Q8_0.gguf), the model loads successfully via thellama-cppbackend and utilizes the GPU, but the returned embedding vector is consistently empty ([]), with logs indicatingembedding disabled.
This suggests an issue with the integration or routing of the llama-embeddings backend within the Docker image builds for CUDA 12, or potentially a parameter passing issue when using the underlying llama-cpp library directly.
Steps to Reproduce:
-
Environment Setup:
- Operating System: Windows 11
- Docker Desktop installed and running.
- NVIDIA GPU: GeForce GTX 1660 SUPER
- NVIDIA Driver: Compatible with CUDA 12 (logs showed CUDA Version: 12.7).
- LocalAI deployed using Docker Compose.
-
docker-compose.yamlConfiguration:- Used a standard
docker-compose.yamlobtained from the LocalAI GitHub repository. - Modified the
image:to use CUDA 12 compatible tags (testedmaster-cublas-cuda12andmaster-aio-gpu-nvidia-cuda-12). The logs provided below are frommaster-aio-gpu-nvidia-cuda-12. - Added
deploy:section for NVIDIA GPU. - Ensured
volumes:maps./modelsto/models:cached. - Ensured
environment:includesMODELS_PATH=/modelsandDEBUG=true. - Crucially, removed or commented out the default
command:line. - Removed or commented out
DOWNLOAD_MODELS=true.
# Relevant parts of docker-compose.yaml services: api: image: quay.io/go-skynet/local-ai:master-aio-gpu-nvidia-cuda-12 # Or master-cublas-cuda12 deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] ports: - 8080:8080 environment: - MODELS_PATH=/models - DEBUG=true # - DOWNLOAD_MODELS=true # Removed volumes: - ./models:/models:cached # command: # Removed or commented out # - some-model - Used a standard
-
Model File and Configuration Setup:
- Manually downloaded the
multilingual-e5-base-Q8_0.gguffile fromhttps://huggingface.co/yixuan-chia/multilingual-e5-base-gguf. - Created the
./models/directory in the LocalAI project root. - Placed the downloaded
multilingual-e5-base-Q8_0.gguffile in the./models/directory. - Created the
multilingual-e5-base.yamlfile in the./models/directory with the following content:
# ./models/multilingual-e5-base.yaml name: multilingual-e5-base backend: llama-embeddings # Specify backend embeddings: true # Mark as embeddings model parameters: model: multilingual-e5-base-Q8_0.gguf # File name relative to MODELS_PATH n_gpu_layers: -1 # Attempt to offload all layers to GPU embedding: true # Explicitly set embedding parameter f16: true - Manually downloaded the
-
Deploy LocalAI:
- Open PowerShell in the directory containing
docker-compose.yaml. - Run
docker-compose down. - Run
docker-compose pull <selected_image_tag>. - Run
docker-compose up -d.
- Open PowerShell in the directory containing
-
Attempt Embeddings API Calls: Wait for LocalAI to start (check logs or
/readyz).- Attempt 1 (Using YAML name):
curl -X POST http://localhost:8080/v1/embeddings ` -H "Content-Type: application/json" ` -d '{"input": "这是一个测试句子。", "model": "multilingual-e5-base"}' ` # Use YAML name -v - Attempt 2 (Using GGUF filename):
(Note: Addingcurl -X POST http://localhost:8080/v1/embeddings ` -H "Content-Type: application/json" ` -d '{"input": "这是一个测试句子。", "model": "multilingual-e5-base-Q8_0.gguf"}' ` # Use GGUF filename -v"embeddings": trueto the JSON body in Attempt 2 yielded the same result).
- Attempt 1 (Using YAML name):
Expected Behavior:
- Both Attempt 1 and Attempt 2 should return a
200 OKresponse with a JSON body containing adataarray, where each element has a non-emptyembeddinglist (the vector). - Logs should indicate successful loading and use of the model, preferably utilizing the GPU.
Observed Behavior:
- Attempt 1 (Using YAML name): Returns
500 Internal Server Errorwith the message"failed to load model with internal loader: backend not found: /tmp/localai/backend_data/backend-assets/grpc/llama-embeddings". (See Curl Output 1 below). - Attempt 2 (Using GGUF filename): Returns
200 OKstatus, but theembeddinglist in the JSON response is empty ([]). (See Curl Output 2 below). Docker logs show the model is loaded but embedding is disabled.
Environment Information:
- OS: Windows 11
- Docker Desktop Version: (Please specify your version, e.g., 4.29.0)
- GPU: NVIDIA GeForce GTX 1660 SUPER
- NVIDIA Driver Version: (Please specify your driver version)
- CUDA Version (as reported by
nvidia-smiin logs): 12.7 - LocalAI Docker Image Tags Tested:
quay.io/go-skynet/local-ai:master-cublas-cuda12,quay.io/go-skynet/local-ai:master-aio-gpu-nvidia-cuda-12, potentially others fromsha-*-cuda12. All tested tags exhibiting the "backend not found" error when using the YAML name. - LocalAI Version (as reported in logs):
4076ea0(from the master branch)
Relevant Logs:
-
Curl Output 1 (Attempt 1 - calling with YAML name):
(base) PS E:\AI\LocalAI> curl -X POST http://localhost:8080/v1/embeddings ` >> -H "Content-Type: application/json" ` >> -d '{"input": "这是一个测试句子。", "model": "multilingual-e5-base"}' ` # <-- Use YAML name {"error":{"code":500,"message":"failed to load model with internal loader: backend not found: /tmp/localai/backend_data/backend-assets/grpc/llama-embeddings","type":""}} ... (rest of curl -v output showing 500 Internal Server Error) ... -
Curl Output 2 (Attempt 2 - calling with GGUF filename):
(base) PS E:\AI\LocalAI> curl -X POST http://localhost:8080/v1/embeddings ` >> -H "Content-Type: application/json" ` >> -d '{"input": "这是一个测试句子。", "model": "multilingual-e5-base-Q8_0.gguf"}' ` # <-- Use GGUF filename {"created":1746090262,"object":"list","id":"a4e28026-95c6-46d5-ad7b-3a3ce87a14e5","model":"multilingual-e5-base-Q8_0.gguf","data":[{"embedding":[],"index":0,"object":"embedding"}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}} ... (rest of curl -v output showing 200 OK) ...(The output is the same when adding
"embeddings": trueto the request body). -
Docker Logs (Excerpt showing "backend not found" for YAML name call):
... (startup logs) ... 8:59AM INF Preloading models from /models # LocalAI finds the YAML and GGUF Model name: multilingual-e5-base 8:59AM DBG Model: multilingual-e5-base (config: {... parameters:{model:multilingual-e5-base-Q8_0.gguf ... Backend:llama-embeddings Embeddings:true ...}}) # Correct config loaded ... (user sends curl request with model: "multilingual-e5-base") ... 8:59AM INF BackendLoader starting backend=llama-embeddings modelID=multilingual-e5-base o.model=multilingual-e5-base-Q8_0.gguf # Attempting to load via backend name 8:59AM DBG Loading model in memory from file: /models/multilingual-e5-base-Q8_0.gguf # Attempting to load file 8:59AM DBG Loading Model multilingual-e5-base with gRPC (file: /models/multilingual-e5-base-Q8_0.gguf) (backend: llama-embeddings): {...} 8:59AM ERR Server error error="failed to load model with internal loader: backend not found: /tmp/localai/backend_data/backend-assets/grpc/llama-embeddings" ip=172.19.0.1 latency=2m22.975112253s method=POST status=500 url=/v1/embeddings # Backend executable not found ... -
Docker Logs (Excerpt showing model loaded but embedding disabled for GGUF filename call):
... (user sends curl request with model: "multilingual-e5-base-Q8_0.gguf") ... 9:04AM DBG Model file loaded: multilingual-e5-base-Q8_0.gguf architecture=bert bosTokenID=0 eosTokenID=2 modelName= # File identified ... 9:04AM INF Trying to load the model 'multilingual-e5-base-Q8_0.gguf' with the backend '[llama-cpp llama-cpp-fallback ...]' # Tries multiple backends, including llama-cpp 9:04AM INF [llama-cpp] Attempting to load ... 9:04AM DBG GRPC(multilingual-e5-base-Q8_0.gguf-...): stderr llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce GTX 1660 SUPER) - 5134 MiB free # GPU detected and used ... 9:04AM DBG GRPC(multilingual-e5-base-Q8_0.gguf-...): stderr llama_model_loader: loaded meta data with 35 key-value pairs ... from /models/multilingual-e5-base-Q8_0.gguf (version GGUF V3 (latest)) # GGUF loaded successfully ... 9:04AM INF [llama-cpp] Loads OK # Model loaded successfully by llama-cpp ... 9:04AM DBG GRPC(multilingual-e5-base-Q8_0.gguf-...): stdout {"timestamp":...,"level":"WARNING","function":"send_embedding","line":1368,"message":"embedding disabled","params.embedding":false} # Embedding is explicitly disabled ... 9:04AM DBG Response: {"created":...,"object":"list","id":...,"model":"multilingual-e5-base-Q8_0.gguf","data":[{"embedding":[],"index":0,"object":"embedding"}],"usage":{...}} # Empty embedding returned ...
Additional Context:
- The
text-embedding-ada-002model, which also uses thellama-cppbackend (based on its YAML configuration in LocalAI's AIO image), successfully loads and returns embedding vectors using the same LocalAI Docker image and the/v1/embeddingsendpoint. This confirms that the corellama-cpplibrary and the general embeddings functionality are working correctly within the container and with the GPU. - This issue seems specific to how the
multilingual-e5-basemodel (perhaps due to its architecture being "bert" as shown in logs, or differences in its GGUF structure) interacts with LocalAI'sllama-embeddingsbackend abstraction, or how parameters (likeembeddings: true) are passed tollama-cppin different loading scenarios. - I have tried different CUDA 12 master branch tags (
master-cublas-cuda12,master-aio-gpu-nvidia-cuda-12) and they all exhibit the same "backend not found" error when calling by YAML name.
This detailed information should help the LocalAI developers diagnose the specific issue within their build or model loading logic for llama-embeddings with this type of model/GGUF.
Im seeing the same behavior when using nomic-ai/nomic-embed-text-v1.5.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This issue was closed because it has been stalled for 5 days with no activity.