LocalAI icon indicating copy to clipboard operation
LocalAI copied to clipboard

VRAM is not freed when stopping models

Open jroeber opened this issue 2 months ago • 10 comments

LocalAI version: localai/localai:v3.9.0-gpu-nvidia-cuda-13

Environment, CPU architecture, OS, and Version:

Linux 6.17.0-8-generic #8-Ubuntu SMP PREEMPT_DYNAMIC Fri Nov 14 21:44:46 UTC 2025 x86_64 GNU/Linux

AMD CPU Nvidia RTX 3060 GPU Ubuntu 25.10 Kubernetes (RKE2 v1.35.0+rke2r1), using Nvidia k8s-device-plugin with timeSlicing GPU sharing method

Describe the bug

When LocalAI stops a model (either manual or LRU), the child process is not killed, and thus the VRAM is still allocated.

To Reproduce

  1. Start up a model using any backend (llama or stablediffusion)
  2. Click the stop model button
  3. Observe that the VRAM is still shown as allocated in the LocalAI GUI. Also observe that nvidia-smi on the host shows a child process still running and holding the VRAM.

Expected behavior

When LocalAI stops a model, the child process should stop, and the VRAM should be freed.

Logs

After stopping a model, the GPU use in the GUI stays the same, but no models are listed below it:

Image

nvidia-smi shows a child process of LocalAI still running:

Sat Jan 10 17:37:10 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        Off |   00000000:01:00.0 Off |                  N/A |
|  0%   56C    P8             15W /  170W |    6061MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A          445091      C   ...ds/cuda13-llama-cpp/lib/ld.so       6052MiB |
+-----------------------------------------------------------------------------------------+

In this case, PID 445091 is a child of the LocalAI parent process.

Additional context

I can "fix" this by restarting the pod, but for some reason LocalAI is not killing its child processes in my setup when a model is unloaded. This ties up VRAM and keeps me from starting other models.

jroeber avatar Jan 10 '26 17:01 jroeber

@jroeber can you list the processes states in your pod when this happens? is the backend process in zombie mode?

mudler avatar Jan 10 '26 17:01 mudler

Sure, here's the process tree:

root      482339  0.0  0.0 1243288 17128 ?       Sl   17:48   0:00 /var/lib/rancher/rke2/data/v1.35.0-rke2r1-a512a7604726/bin/containerd-shim-runc-v2 -namespace k8s.io -id fb9b1efd0737e3af41266c55c4ab83bd984413653d886ac900a9c71a90e901f3 -address /run/k3s/containerd/containerd.sock
6422527   482371  0.0  0.0    980   548 ?        Ss   17:48   0:00  \_ /pause
6356992   482445  1.4  0.1 3972864 74092 ?       Ssl  17:48   0:07  \_ ./local-ai --debug
6356992   487005  0.7  1.1 47652848 747184 ?     Sl   17:51   0:02      \_ /backends/cuda13-llama-cpp/lib/ld.so /backends/cuda13-llama-cpp/llama-cpp-avx512 --addr 127.0.0.1:36477

I am using hostUsers: false (user-namespaced pods) for this pod.

jroeber avatar Jan 10 '26 17:01 jroeber

@jroeber ok that doesn't look like a zombie process. Can you please share:

  • logs with --debug when this is happening
  • output of ps -o pid,ppid,state,wchan,cmd -p <PID>

There are couple of improvements we can do nevertheless in https://github.com/mudler/go-processmanager/pull/3, but I don't think these should change anything for llama.cpp processes

mudler avatar Jan 10 '26 18:01 mudler

For the model process:

$ ps -o pid,ppid,state,wchan,cmd -p 487005
    PID    PPID S WCHAN  CMD
 487005  482445 S -      /backends/cuda13-llama-cpp/lib/ld.so /backends/cuda13-llama-cpp/llama-cpp-avx512 --addr 127.0.0.1:36477

For the local-ai process:

$ ps -o pid,ppid,state,wchan,cmd -p 482445
    PID    PPID S WCHAN  CMD
 482445  482339 S -      ./local-ai --debug

Logs from after killing the model:

Jan 10 18:07:09 DEBUG Sending chunk chunk="{\"created\":1768068423,\"object\":\"chat.completion.chunk\",\"id\":\"1140b35a-d5e7-4035-8957-cb0cf59b188d\",\"model\":\"qwen3-8b\",\"choices\":[{\"index\":0,\"finish_reason\":null,\"delta\":{\"content\":\"\"}}],\"usage\":{\"prompt_tokens\":10,\"completion_tokens\":126,\"total_tokens\":136}}" caller={caller.file="/build/core/http/endpoints/openai/chat.go"  caller.L=380 } 
Jan 10 18:07:09 DEBUG GRPC stderr id="qwen3-8b-127.0.0.1:36853" line="       eval time =    2002.04 ms /   126 tokens (   15.89 ms per token,    62.94 tokens per second)" caller={caller.file="/build/pkg/model/process.go"  caller.L=146 } 
Jan 10 18:07:09 DEBUG GRPC stderr id="qwen3-8b-127.0.0.1:36853" line="      total time =    2016.06 ms /   136 tokens" caller={caller.file="/build/pkg/model/process.go"  caller.L=146 } 
Jan 10 18:07:09 DEBUG GRPC stderr id="qwen3-8b-127.0.0.1:36853" line="slot      release: id  0 | task 0 | stop processing: n_tokens = 135, truncated = 0" caller={caller.file="/build/pkg/model/process.go"  caller.L=146 } 
Jan 10 18:07:09 DEBUG GRPC stderr id="qwen3-8b-127.0.0.1:36853" line="srv  update_slots: all slots are idle" caller={caller.file="/build/pkg/model/process.go"  caller.L=146 } 
Jan 10 18:07:09 DEBUG No choices in the response, skipping caller={caller.file="/build/core/http/endpoints/openai/chat.go"  caller.L=367 } 
Jan 10 18:07:09 DEBUG Stream ended caller={caller.file="/build/core/http/endpoints/openai/chat.go"  caller.L=448 } 
Jan 10 18:07:09 INFO  HTTP request method="POST" path="/v1/chat/completions" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 DEBUG No system backends found caller={caller.file="/build/core/gallery/backends.go"  caller.L=335 } 
Jan 10 18:07:11 INFO  Using forced capability from environment variable capability="nvidia" env="LOCALAI_FORCE_META_BACKEND_CAPABILITY" caller={caller.file="/build/pkg/system/capabilities.go"  caller.L=64 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/static/assets/highlightjs.css" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/static/theme.css" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/static/typography.css" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/static/general.css" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/static/animations.css" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/static/components.css" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/static/assets/fontawesome/css/solid.css" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/static/assets/fontawesome/css/brands.css" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/static/assets/highlightjs.js" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/static/assets/fontawesome/css/fontawesome.css" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/static/assets/tw-elements.css" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/static/assets/playfair-display-bold.ttf" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/static/assets/space-grotesk-regular.ttf" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/static/assets/flowbite.min.js" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/static/assets/tailwindcss.js" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/static/logo_horizontal.png" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/static/logo.png" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/static/assets/tw-elements.js" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/static/assets/alpine.js" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/static/assets/marked.js" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/static/assets/purify.js" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/static/assets/fontawesome/webfonts/fa-solid-900.woff2" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/static/assets/space-grotesk-medium.ttf" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/static/assets/space-grotesk-semibold.ttf" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/static/assets/fontawesome/webfonts/fa-brands-400.woff2" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/api/resources" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:11 INFO  HTTP request method="GET" path="/static/assets/jetbrains-mono-regular.ttf" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:15 DEBUG Deleting process model="qwen3-8b" caller={caller.file="/build/pkg/model/process.go"  caller.L=45 } 
Jan 10 18:07:15 ERROR (deleteProcess) error while deleting process error=permission denied model="qwen3-8b" caller={caller.file="/build/pkg/model/process.go"  caller.L=56 } 
Jan 10 18:07:15 INFO  HTTP request method="POST" path="/backend/shutdown" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:21 INFO  HTTP request method="GET" path="/api/resources" status=200 caller={caller.file="/build/core/http/app.go"  caller.L=111 } 
Jan 10 18:07:24 DEBUG No system backends found caller={caller.file="/build/core/gallery/backends.go"  caller.L=335 } 

This is from starting up qwen3-8b, giving a short prompt, then manually stopping the model in the GUI. I didn't notice the permission denied error before. What would keep the process from being stopped?

jroeber avatar Jan 10 '26 18:01 jroeber

Jan 10 18:07:15 ERROR (deleteProcess) error while deleting process error=permission denied model="qwen3-8b" caller={caller.file="/build/pkg/model/process.go" caller.L=56 }

interesting, it looks from the logs that we don't have enough permissions. My wild guess it is that comes from https://github.com/mudler/go-processmanager/blob/8b802d3ecf828b2bfed26d32fc00dc2dc3e4e23d/process.go#L176 .

Do you have permissions to send process signals? can you try to check if lowering permissions bar (e.g. by setting privileged) make it work? Just to try to nail out if it's rather a permission issue. I'll add more detailed error messages to trace this clearly

mudler avatar Jan 10 '26 18:01 mudler

Okay - running it as a privileged container allows me to stop the model manually without errors.

I then removed the security context entirely, which caused the pod to run in non-privileged mode but in the host Linux namespace (standard for most Kubernetes setups), and the permission error came back.

jroeber avatar Jan 10 '26 18:01 jroeber

For reference, here is my deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: localai
spec:
  selector:
    matchLabels:
      app: localai
  template:
    metadata:
      labels:
        app: localai
    spec:
      # hostUsers: false # this was true at the start of the thread, I set to false for debugging
      runtimeClassName: nvidia
      containers:
      - name: localai
        image: localai/localai:v3.9.0-gpu-nvidia-cuda-13
        command: ["./entrypoint.sh", "--debug"]
        #securityContext:
            #privileged: true
            #allowPrivilegeEscalation: true # when these two are set and the others commented, it works

            # allowPrivilegeEscalation: false
            # capabilities:
            #   drop:
            #     - ALL
            # seccompProfile:
            #   type: RuntimeDefault # these are normally in place, but commented out for debugging
        env:
          - name: LOCALAI_FORCE_META_BACKEND_CAPABILITY
            value: nvidia
          - name: LOCALAI_MAX_ACTIVE_BACKENDS
            value: "1"
        resources:
          requests:
            memory: "16Gi"
            cpu: "4"
          limits:
            memory: "32Gi"
            cpu: "16"
            nvidia.com/gpu: "1"
        ports:
        - containerPort: 8080
        # ...plus a bunch of volume mount stuff for configMap and persistence

jroeber avatar Jan 10 '26 18:01 jroeber

we might need to add a specific stanza here, probably dependant from the container runtime to add CAP_KILL (https://unofficial-kubernetes.readthedocs.io/en/latest/concepts/policy/container-capabilities/):

    securityContext:
      capabilities:
        add:
        - KILL

To note, I've been running this without issues on k3s.

mudler avatar Jan 10 '26 18:01 mudler

Explicitly adding the KILL capability had no effect. (I also tried CAP_KILL.)

I do have RKE2's CIS mode enabled, but I currently have the localai namespace excluded from that policy inrke2-pss.yaml.

I'll do some more digging to see if there's anything else peculiar about my setup.

jroeber avatar Jan 10 '26 18:01 jroeber

Okay, I've done some experimentation and have narrowed it down to an issue with running a pod with hostUsers: false.

This pod spec, which is compatible with the built-in Restricted Pod Security Standard, works (i.e. allows LocalAI to kill model processes without encountering permission issues):

spec:
  hostUsers: true
  runtimeClassName: nvidia
  containers:
  - name: localai
    image: localai/localai:v3.9.0-gpu-nvidia-cuda-13
    command: ["./entrypoint.sh", "--debug"]
    securityContext:
      capabilities:
        drop:
          - ALL
      allowPrivilegeEscalation: false
      seccompProfile:
        type: RuntimeDefault
      runAsUser: 1007
      runAsGroup: 1007
      runAsNonRoot: true

This pod spec, which uses privileged mode (therefore incompatible with Restricted) within a user namespace, works:

spec:
  hostUsers: false
  runtimeClassName: nvidia
  containers:
  - name: localai
    image: localai/localai:v3.9.0-gpu-nvidia-cuda-13
    command: ["./entrypoint.sh", "--debug"]
    securityContext:
      privileged: true

However, I have been unable to find any combination of capabilities to add to an non-privileged, user-namespaced pod (ALL, KILL, SYS_ADMIN, etc.) that will give it the ability to kill the model processes. For some reason, it must be a privileged pod. I have also tried using securityContext.seccompProfile.type = Unconfined with no luck.

So, while I would like to run this in a user namespace as a non-privileged pod and don't see why it wouldn't work, I can run in the host namespace as a non-root user with no capabilities instead, which is just as good for me. Maybe this will help someone else trying to do something similar later on, though.

My guess is it's a Kubernetes issue rather than a LocalAI issue, so feel free to close if you agree.

jroeber avatar Jan 11 '26 02:01 jroeber