AIO - memory issue - embedding
LocalAI version: container image: AIO Cuda12-latest
Environment, CPU architecture, OS, and Version: VM ubuntu 22.04 latest nvidia 2600
Describe the bug get memory issue while switching and testing multiple prompts. Error for embeddings while image generation works fine.
curl http://linuxmain.local:8445/embeddings \
-X POST -H "Content-Type: application/json" \
-d '{
"input": "Your text string goes here",
"model": "text-embedding-ada-002"
}'
{"error":{"code":500,"message":"could not load model (no success): Unexpected err=OutOfMemoryError('CUDA out of memory. Tried to allocate 46.00 MiB. GPU 0 has a total capacty of 5.62 GiB of which 55.50 MiB is free. Process 46 has 0 bytes memory in use. Process 52 has 0 bytes memory in use. Process 122 has 0 bytes memory in use. Process 158 has 0 bytes memory in use. Process 223 has 0 bytes memory in use. Process 303 has 0 bytes memory in use. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF'), type(err)=\u003cclass 'torch.cuda.OutOfMemoryError'\u003e","type":""}}
To Reproduce all the curl tests published in the documentation
Expected behavior no error, old models are evicted if memory pressure is too high
Logs
localai-docker-api-1 | curl http://localhost:8080/embeddings -X POST -H "Content-Type: application/json" -d '{
localai-docker-api-1 | "input": "Your text string goes here",
localai-docker-api-1 | "model": "text-embedding-ada-002"
localai-docker-api-1 | }'}
localai-docker-api-1 | 8:36AM INF Loading model 'all-MiniLM-L6-v2' with backend sentencetransformers
localai-docker-api-1 | 8:36AM DBG Loading model in memory from file: /build/models/all-MiniLM-L6-v2
localai-docker-api-1 | 8:36AM DBG Loading Model all-MiniLM-L6-v2 with gRPC (file: /build/models/all-MiniLM-L6-v2) (backend: sentencetransformers): {backendString:sentencetransformers model:all-MiniLM-L6-v2 threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc001c7ee00 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
localai-docker-api-1 | 8:36AM DBG Loading external backend: /build/backend/python/sentencetransformers/run.sh
localai-docker-api-1 |
localai-docker-api-1 | 8:36AM DBG Loading GRPC Process: /build/backend/python/sentencetransformers/run.sh
localai-docker-api-1 | 8:36AM DBG GRPC Service for all-MiniLM-L6-v2 will be running at: '127.0.0.1:44963'
localai-docker-api-1 | 8:36AM DBG GRPC Service state dir: /tmp/go-processmanager596413675
localai-docker-api-1 | 8:36AM DBG GRPC Service Started
localai-docker-api-1 | 8:36AM DBG GRPC(all-MiniLM-L6-v2-127.0.0.1:44963): stderr /opt/conda/envs/transformers/lib/python3.11/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
localai-docker-api-1 | 8:36AM DBG GRPC(all-MiniLM-L6-v2-127.0.0.1:44963): stderr warnings.warn(
localai-docker-api-1 | 8:36AM DBG GRPC(all-MiniLM-L6-v2-127.0.0.1:44963): stderr Server started. Listening on: 127.0.0.1:44963
localai-docker-api-1 | 8:36AM DBG GRPC Service Ready
localai-docker-api-1 | 8:36AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:all-MiniLM-L6-v2 ContextSize:512 Seed:2025187195 NBatch:512 F16Memory:false MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:4 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/build/models/all-MiniLM-L6-v2 Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
localai-docker-api-1 | [172.30.0.1]:41244 500 - POST /embeddings
Additional context
Hello shuther,
Thank you for reporting this issue. I apologize for the inconvenience you're experiencing with the LocalAI container image. Based on the information you've provided, it seems this is a memory allocation issue on your VM.
Could you please provide the output of the following command to help us gather more details about your GPU and system capabilities?
nvidia-smi
This command will give us information about the GPU usage and memory details.
Additionally, please try to run the command you provided but with the following modification:
CUDA_LAUNCH_BLOCKING=0
This will disable CUDA launch blocking, which might help to prevent the GPU memory error. You should run the command like this:
CUDA_LAUNCH_BLOCKING=0 curl http://linuxmain.local:8445/embeddings \
-X POST -H "Content-Type: application/json" \
-d '{"input": "Your text string goes here", "model": "text-embedding-ada-002}"'
I understand that this issue is crucial for your project, and I assure you that we'll do our best to investigate this problem and provide a solution as soon as possible.
Best regards,
The Github Bot (Experiment of @mudler)
works fine if I stop the docker and start it again; likely we need a better eviction approach when we switch models?
extra logs nvidia-smi # when I launch the docker (initial load)
Thu Apr 25 11:14:44 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 2060 Off | 00000000:13:00.0 On | N/A |
| 38% 38C P8 16W / 160W | 258MiB / 6144MiB | 22% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2158 G /usr/lib/xorg/Xorg 131MiB |
| 0 N/A N/A 2686 G /usr/bin/gnome-shell 67MiB |
| 0 N/A N/A 3376 G /usr/bin/nextcloud 3MiB |
| 0 N/A N/A 24782 G ...30092458,1701102826035513081,262144 50MiB |
+---------------------------------------------------------------------------------------+
I spotted this error also:
localai-docker-api-1 | 9:15AM INF Trying to load the model '5c7cd056ecf9a4bb5b527410b97f48cb' with all the available backends: llama-cpp, llama-ggml, gpt4all, bert-embeddings, rwkv, whisper, stablediffusion, tinydream, piper, /build/backend/python/vall-e-x/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/diffusers/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/vllm/run.sh, /build/backend/python/exllama2/run.sh, /build/backend/python/bark/run.sh, /build/backend/python/transformers/run.sh, /build/backend/python/autogptq/run.sh, /build/backend/python/coqui/run.sh, /build/backend/python/mamba/run.sh, /build/backend/python/transformers-musicgen/run.sh, /build/backend/python/petals/run.sh, /build/backend/python/exllama/run.sh
localai-docker-api-1 | 9:15AM INF [llama-cpp] Attempting to load
localai-docker-api-1 | 9:15AM INF Loading model '5c7cd056ecf9a4bb5b527410b97f48cb' with backend llama-cpp
localai-docker-api-1 | 9:15AM DBG Loading model in memory from file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb
localai-docker-api-1 | 9:15AM DBG Loading Model 5c7cd056ecf9a4bb5b527410b97f48cb with gRPC (file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb) (backend: llama-cpp): {backendString:llama-cpp model:5c7cd056ecf9a4bb5b527410b97f48cb threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0000bae00 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
localai-docker-api-1 | 9:15AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-cpp
localai-docker-api-1 | 9:15AM DBG GRPC Service for 5c7cd056ecf9a4bb5b527410b97f48cb will be running at: '127.0.0.1:44089'
localai-docker-api-1 | 9:15AM INF [llama-cpp] Fails: fork/exec /tmp/localai/backend_data/backend-assets/grpc/llama-cpp: permission denied
localai-docker-api-1 | 9:15AM INF [llama-ggml] Attempting to load
localai-docker-api-1 | 9:15AM DBG GRPC Service for 5c7cd056ecf9a4bb5b527410b97f48cb will be running at: '127.0.0.1:44789'
localai-docker-api-1 | 9:15AM INF [rwkv] Fails: fork/exec /tmp/localai/backend_data/backend-assets/grpc/rwkv: permission denied
...
ocalai-docker-api-1 | 9:15AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/whisper
localai-docker-api-1 | 9:15AM DBG GRPC Service for 5c7cd056ecf9a4bb5b527410b97f48cb will be running at: '127.0.0.1:42503'
localai-docker-api-1 | 9:15AM INF [whisper] Fails: fork/exec /tmp/localai/backend_data/backend-assets/grpc/whisper: permission denied
localai-docker-api-1 | 9:15AM INF [stablediffusion] Attempting to load
...
localai-docker-api-1 | 9:15AM INF [/build/backend/python/vall-e-x/run.sh] Fails: grpc process not found: /tmp/localai/backend_data/backend-assets/grpc/build/backend/python/vall-e-x/run.sh. some backends(stablediffusion, tts) require LocalAI compiled with GO_TAGS
Now with LOCALAI_SINGLE_ACTIVE_BACKEND=true we get the embedding working. I would recommend making a change to the docker compose yaml file to load by default the .env (and update the documentation since it seems a crucial parameter) Still the eviction in case of memory error should be tried ?
nvidia-smi
Thu Apr 25 11:19:50 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 2060 Off | 00000000:13:00.0 On | N/A |
| 38% 39C P8 13W / 160W | 4422MiB / 6144MiB | 20% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2158 G /usr/lib/xorg/Xorg 131MiB |
| 0 N/A N/A 2686 G /usr/bin/gnome-shell 67MiB |
| 0 N/A N/A 3376 G /usr/bin/nextcloud 3MiB |
| 0 N/A N/A 24782 G ...30092458,1701102826035513081,262144 50MiB |
| 0 N/A N/A 1647486 C python 0MiB |
| 0 N/A N/A 1647698 C python 0MiB |
+---------------------------------------------------------------------------------------+
I believe that the eviction process is being assessed atm, maybe related to #2047 and #2102