text-embeddings-inference Could not start backend: cannot find tensor embeddings.word

Could not start backend: cannot find tensor embeddings.word_embeddings.weight

Open momomobinx opened this issue 7 months ago • 10 comments

System Info

docker

docker run \
        -d \
        --name reranker \
        --gpus '"device=0"' \
        --env CUDA_VISIBLE_DEVICES=0 \
        -p 7863:80 \
        -v /data/ai/models:/data \
        ghcr.io/huggingface/text-embeddings-inference:86-1.5 \
        --model-id "/data/bge-reranker-base" \
        --dtype "float16" \
        --max-concurrent-requests 2048 \
        --max-batch-tokens 32768000 \
        --max-batch-requests 128 \
        --max-client-batch-size 4096 \
        --auto-truncate \
        --tokenization-workers 64 \
        --payload-limit 16000000

nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.142                Driver Version: 550.142        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:5E:00.0 Off |                  N/A |
| 42%   22C    P8             17W /  350W |   24237MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

Information

[x] Docker
[ ] The CLI directly

Tasks

[x] An officially supported command
[ ] My own modifications

Reproduction

docker run
-d
--name reranker
--gpus '"device=0"'
--env CUDA_VISIBLE_DEVICES=0
-p 7863:80
-v /data/ai/models:/data
ghcr.io/huggingface/text-embeddings-inference:86-1.5
--model-id "/data/bge-reranker-base"
--dtype "float16"
--max-concurrent-requests 2048
--max-batch-tokens 32768000
--max-batch-requests 128
--max-client-batch-size 4096
--auto-truncate
--tokenization-workers 64
--payload-limit 16000000

Expected behavior

It was still running normally before, until I encountered the context was too long, and then I couldn't successfully restart the model

Mar 26 '25 07:03 momomobinx

text-embeddings-inference text-embeddings-inference copied to clipboard

Could not start backend: cannot find tensor embeddings.word_embeddings.weight

System Info

Information

Tasks

Reproduction

Expected behavior

text-embeddings-inference
text-embeddings-inference copied to clipboard