LocalAI AIO - memory issue

LocalAI version: container image: AIO Cuda12-latest

Environment, CPU architecture, OS, and Version: VM ubuntu 22.04 latest nvidia 2600

Describe the bug get memory issue while switching and testing multiple prompts. Error for embeddings while image generation works fine.

curl http://linuxmain.local:8445/embeddings \
  -X POST -H "Content-Type: application/json" \
  -d '{
      "input": "Your text string goes here",
      "model": "text-embedding-ada-002"
    }'

{"error":{"code":500,"message":"could not load model (no success): Unexpected err=OutOfMemoryError('CUDA out of memory. Tried to allocate 46.00 MiB. GPU 0 has a total capacty of 5.62 GiB of which 55.50 MiB is free. Process 46 has 0 bytes memory in use. Process 52 has 0 bytes memory in use. Process 122 has 0 bytes memory in use. Process 158 has 0 bytes memory in use. Process 223 has 0 bytes memory in use. Process 303 has 0 bytes memory in use. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF'), type(err)=\u003cclass 'torch.cuda.OutOfMemoryError'\u003e","type":""}}

To Reproduce all the curl tests published in the documentation

Expected behavior no error, old models are evicted if memory pressure is too high

Logs

localai-docker-api-1  | curl http://localhost:8080/embeddings -X POST -H "Content-Type: application/json" -d '{
localai-docker-api-1  |   "input": "Your text string goes here",
localai-docker-api-1  |   "model": "text-embedding-ada-002"
localai-docker-api-1  | }'}
localai-docker-api-1  | 8:36AM INF Loading model 'all-MiniLM-L6-v2' with backend sentencetransformers
localai-docker-api-1  | 8:36AM DBG Loading model in memory from file: /build/models/all-MiniLM-L6-v2
localai-docker-api-1  | 8:36AM DBG Loading Model all-MiniLM-L6-v2 with gRPC (file: /build/models/all-MiniLM-L6-v2) (backend: sentencetransformers): {backendString:sentencetransformers model:all-MiniLM-L6-v2 threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc001c7ee00 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
localai-docker-api-1  | 8:36AM DBG Loading external backend: /build/backend/python/sentencetransformers/run.sh
localai-docker-api-1  |
localai-docker-api-1  | 8:36AM DBG Loading GRPC Process: /build/backend/python/sentencetransformers/run.sh
localai-docker-api-1  | 8:36AM DBG GRPC Service for all-MiniLM-L6-v2 will be running at: '127.0.0.1:44963'
localai-docker-api-1  | 8:36AM DBG GRPC Service state dir: /tmp/go-processmanager596413675
localai-docker-api-1  | 8:36AM DBG GRPC Service Started
localai-docker-api-1  | 8:36AM DBG GRPC(all-MiniLM-L6-v2-127.0.0.1:44963): stderr /opt/conda/envs/transformers/lib/python3.11/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
localai-docker-api-1  | 8:36AM DBG GRPC(all-MiniLM-L6-v2-127.0.0.1:44963): stderr   warnings.warn(
localai-docker-api-1  | 8:36AM DBG GRPC(all-MiniLM-L6-v2-127.0.0.1:44963): stderr Server started. Listening on: 127.0.0.1:44963
localai-docker-api-1  | 8:36AM DBG GRPC Service Ready
localai-docker-api-1  | 8:36AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:all-MiniLM-L6-v2 ContextSize:512 Seed:2025187195 NBatch:512 F16Memory:false MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:4 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/build/models/all-MiniLM-L6-v2 Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
localai-docker-api-1  | [172.30.0.1]:41244 500 - POST /embeddings

Additional context

Apr 25 '24 08:04 shuther

Hello shuther,

Thank you for reporting this issue. I apologize for the inconvenience you're experiencing with the LocalAI container image. Based on the information you've provided, it seems this is a memory allocation issue on your VM.

Could you please provide the output of the following command to help us gather more details about your GPU and system capabilities?

nvidia-smi

This command will give us information about the GPU usage and memory details.

Additionally, please try to run the command you provided but with the following modification:

CUDA_LAUNCH_BLOCKING=0

This will disable CUDA launch blocking, which might help to prevent the GPU memory error. You should run the command like this:

CUDA_LAUNCH_BLOCKING=0 curl http://linuxmain.local:8445/embeddings \
  -X POST -H "Content-Type: application/json" \
  -d '{"input": "Your text string goes here", "model": "text-embedding-ada-002}"'

I understand that this issue is crucial for your project, and I assure you that we'll do our best to investigate this problem and provide a solution as soon as possible.

Best regards,

The Github Bot (Experiment of @mudler)

Apr 25 '24 08:04 localai-bot

works fine if I stop the docker and start it again; likely we need a better eviction approach when we switch models?

Apr 25 '24 08:04 shuther

extra logs nvidia-smi # when I launch the docker (initial load)

Thu Apr 25 11:14:44 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2060        Off | 00000000:13:00.0  On |                  N/A |
| 38%   38C    P8              16W / 160W |    258MiB /  6144MiB |     22%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+


+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2158      G   /usr/lib/xorg/Xorg                          131MiB |
|    0   N/A  N/A      2686      G   /usr/bin/gnome-shell                         67MiB |
|    0   N/A  N/A      3376      G   /usr/bin/nextcloud                            3MiB |
|    0   N/A  N/A     24782      G   ...30092458,1701102826035513081,262144       50MiB |
+---------------------------------------------------------------------------------------+

I spotted this error also:

localai-docker-api-1  | 9:15AM INF Trying to load the model '5c7cd056ecf9a4bb5b527410b97f48cb' with all the available backends: llama-cpp, llama-ggml, gpt4all, bert-embeddings, rwkv, whisper, stablediffusion, tinydream, piper, /build/backend/python/vall-e-x/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/diffusers/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/vllm/run.sh, /build/backend/python/exllama2/run.sh, /build/backend/python/bark/run.sh, /build/backend/python/transformers/run.sh, /build/backend/python/autogptq/run.sh, /build/backend/python/coqui/run.sh, /build/backend/python/mamba/run.sh, /build/backend/python/transformers-musicgen/run.sh, /build/backend/python/petals/run.sh, /build/backend/python/exllama/run.sh
localai-docker-api-1  | 9:15AM INF [llama-cpp] Attempting to load
localai-docker-api-1  | 9:15AM INF Loading model '5c7cd056ecf9a4bb5b527410b97f48cb' with backend llama-cpp
localai-docker-api-1  | 9:15AM DBG Loading model in memory from file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb
localai-docker-api-1  | 9:15AM DBG Loading Model 5c7cd056ecf9a4bb5b527410b97f48cb with gRPC (file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb) (backend: llama-cpp): {backendString:llama-cpp model:5c7cd056ecf9a4bb5b527410b97f48cb threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0000bae00 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
localai-docker-api-1  | 9:15AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-cpp
localai-docker-api-1  | 9:15AM DBG GRPC Service for 5c7cd056ecf9a4bb5b527410b97f48cb will be running at: '127.0.0.1:44089'
localai-docker-api-1  | 9:15AM INF [llama-cpp] Fails: fork/exec /tmp/localai/backend_data/backend-assets/grpc/llama-cpp: permission denied
localai-docker-api-1  | 9:15AM INF [llama-ggml] Attempting to load

localai-docker-api-1  | 9:15AM DBG GRPC Service for 5c7cd056ecf9a4bb5b527410b97f48cb will be running at: '127.0.0.1:44789'
localai-docker-api-1  | 9:15AM INF [rwkv] Fails: fork/exec /tmp/localai/backend_data/backend-assets/grpc/rwkv: permission denied
...
ocalai-docker-api-1  | 9:15AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/whisper
localai-docker-api-1  | 9:15AM DBG GRPC Service for 5c7cd056ecf9a4bb5b527410b97f48cb will be running at: '127.0.0.1:42503'
localai-docker-api-1  | 9:15AM INF [whisper] Fails: fork/exec /tmp/localai/backend_data/backend-assets/grpc/whisper: permission denied
localai-docker-api-1  | 9:15AM INF [stablediffusion] Attempting to load
...
localai-docker-api-1  | 9:15AM INF [/build/backend/python/vall-e-x/run.sh] Fails: grpc process not found: /tmp/localai/backend_data/backend-assets/grpc/build/backend/python/vall-e-x/run.sh. some backends(stablediffusion, tts) require LocalAI compiled with GO_TAGS

Now with LOCALAI_SINGLE_ACTIVE_BACKEND=true we get the embedding working. I would recommend making a change to the docker compose yaml file to load by default the .env (and update the documentation since it seems a crucial parameter) Still the eviction in case of memory error should be tried ?

nvidia-smi

Thu Apr 25 11:19:50 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2060        Off | 00000000:13:00.0  On |                  N/A |
| 38%   39C    P8              13W / 160W |   4422MiB /  6144MiB |     20%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2158      G   /usr/lib/xorg/Xorg                          131MiB |
|    0   N/A  N/A      2686      G   /usr/bin/gnome-shell                         67MiB |
|    0   N/A  N/A      3376      G   /usr/bin/nextcloud                            3MiB |
|    0   N/A  N/A     24782      G   ...30092458,1701102826035513081,262144       50MiB |
|    0   N/A  N/A   1647486      C   python                                        0MiB |
|    0   N/A  N/A   1647698      C   python                                        0MiB |
+---------------------------------------------------------------------------------------+

Apr 25 '24 09:04 shuther

I believe that the eviction process is being assessed atm, maybe related to #2047 and #2102

Apr 26 '24 02:04 jtwolfe

AIO - memory issue - embedding