VRAM is not freed when stopping models
LocalAI version: localai/localai:v3.9.0-gpu-nvidia-cuda-13
Environment, CPU architecture, OS, and Version:
Linux 6.17.0-8-generic #8-Ubuntu SMP PREEMPT_DYNAMIC Fri Nov 14 21:44:46 UTC 2025 x86_64 GNU/Linux
AMD CPU
Nvidia RTX 3060 GPU
Ubuntu 25.10
Kubernetes (RKE2 v1.35.0+rke2r1), using Nvidia k8s-device-plugin with timeSlicing GPU sharing method
Describe the bug
When LocalAI stops a model (either manual or LRU), the child process is not killed, and thus the VRAM is still allocated.
To Reproduce
- Start up a model using any backend (llama or stablediffusion)
- Click the stop model button
- Observe that the VRAM is still shown as allocated in the LocalAI GUI. Also observe that
nvidia-smion the host shows a child process still running and holding the VRAM.
Expected behavior
When LocalAI stops a model, the child process should stop, and the VRAM should be freed.
Logs
After stopping a model, the GPU use in the GUI stays the same, but no models are listed below it:
nvidia-smi shows a child process of LocalAI still running:
Sat Jan 10 17:37:10 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:01:00.0 Off | N/A |
| 0% 56C P8 15W / 170W | 6061MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 445091 C ...ds/cuda13-llama-cpp/lib/ld.so 6052MiB |
+-----------------------------------------------------------------------------------------+
In this case, PID 445091 is a child of the LocalAI parent process.
Additional context
I can "fix" this by restarting the pod, but for some reason LocalAI is not killing its child processes in my setup when a model is unloaded. This ties up VRAM and keeps me from starting other models.
@jroeber can you list the processes states in your pod when this happens? is the backend process in zombie mode?
Sure, here's the process tree:
root 482339 0.0 0.0 1243288 17128 ? Sl 17:48 0:00 /var/lib/rancher/rke2/data/v1.35.0-rke2r1-a512a7604726/bin/containerd-shim-runc-v2 -namespace k8s.io -id fb9b1efd0737e3af41266c55c4ab83bd984413653d886ac900a9c71a90e901f3 -address /run/k3s/containerd/containerd.sock
6422527 482371 0.0 0.0 980 548 ? Ss 17:48 0:00 \_ /pause
6356992 482445 1.4 0.1 3972864 74092 ? Ssl 17:48 0:07 \_ ./local-ai --debug
6356992 487005 0.7 1.1 47652848 747184 ? Sl 17:51 0:02 \_ /backends/cuda13-llama-cpp/lib/ld.so /backends/cuda13-llama-cpp/llama-cpp-avx512 --addr 127.0.0.1:36477
I am using hostUsers: false (user-namespaced pods) for this pod.
@jroeber ok that doesn't look like a zombie process. Can you please share:
- logs with
--debugwhen this is happening - output of
ps -o pid,ppid,state,wchan,cmd -p <PID>
There are couple of improvements we can do nevertheless in https://github.com/mudler/go-processmanager/pull/3, but I don't think these should change anything for llama.cpp processes
For the model process:
$ ps -o pid,ppid,state,wchan,cmd -p 487005
PID PPID S WCHAN CMD
487005 482445 S - /backends/cuda13-llama-cpp/lib/ld.so /backends/cuda13-llama-cpp/llama-cpp-avx512 --addr 127.0.0.1:36477
For the local-ai process:
$ ps -o pid,ppid,state,wchan,cmd -p 482445
PID PPID S WCHAN CMD
482445 482339 S - ./local-ai --debug
Logs from after killing the model:
Jan 10 18:07:09 DEBUG Sending chunk chunk="{\"created\":1768068423,\"object\":\"chat.completion.chunk\",\"id\":\"1140b35a-d5e7-4035-8957-cb0cf59b188d\",\"model\":\"qwen3-8b\",\"choices\":[{\"index\":0,\"finish_reason\":null,\"delta\":{\"content\":\"\"}}],\"usage\":{\"prompt_tokens\":10,\"completion_tokens\":126,\"total_tokens\":136}}" caller={caller.file="/build/core/http/endpoints/openai/chat.go" caller.L=380 }
Jan 10 18:07:09 DEBUG GRPC stderr id="qwen3-8b-127.0.0.1:36853" line=" eval time = 2002.04 ms / 126 tokens ( 15.89 ms per token, 62.94 tokens per second)" caller={caller.file="/build/pkg/model/process.go" caller.L=146 }
Jan 10 18:07:09 DEBUG GRPC stderr id="qwen3-8b-127.0.0.1:36853" line=" total time = 2016.06 ms / 136 tokens" caller={caller.file="/build/pkg/model/process.go" caller.L=146 }
Jan 10 18:07:09 DEBUG GRPC stderr id="qwen3-8b-127.0.0.1:36853" line="slot release: id 0 | task 0 | stop processing: n_tokens = 135, truncated = 0" caller={caller.file="/build/pkg/model/process.go" caller.L=146 }
Jan 10 18:07:09 DEBUG GRPC stderr id="qwen3-8b-127.0.0.1:36853" line="srv update_slots: all slots are idle" caller={caller.file="/build/pkg/model/process.go" caller.L=146 }
Jan 10 18:07:09 DEBUG No choices in the response, skipping caller={caller.file="/build/core/http/endpoints/openai/chat.go" caller.L=367 }
Jan 10 18:07:09 DEBUG Stream ended caller={caller.file="/build/core/http/endpoints/openai/chat.go" caller.L=448 }
Jan 10 18:07:09 INFO HTTP request method="POST" path="/v1/chat/completions" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 DEBUG No system backends found caller={caller.file="/build/core/gallery/backends.go" caller.L=335 }
Jan 10 18:07:11 INFO Using forced capability from environment variable capability="nvidia" env="LOCALAI_FORCE_META_BACKEND_CAPABILITY" caller={caller.file="/build/pkg/system/capabilities.go" caller.L=64 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/static/assets/highlightjs.css" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/static/theme.css" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/static/typography.css" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/static/general.css" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/static/animations.css" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/static/components.css" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/static/assets/fontawesome/css/solid.css" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/static/assets/fontawesome/css/brands.css" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/static/assets/highlightjs.js" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/static/assets/fontawesome/css/fontawesome.css" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/static/assets/tw-elements.css" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/static/assets/playfair-display-bold.ttf" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/static/assets/space-grotesk-regular.ttf" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/static/assets/flowbite.min.js" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/static/assets/tailwindcss.js" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/static/logo_horizontal.png" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/static/logo.png" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/static/assets/tw-elements.js" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/static/assets/alpine.js" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/static/assets/marked.js" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/static/assets/purify.js" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/static/assets/fontawesome/webfonts/fa-solid-900.woff2" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/static/assets/space-grotesk-medium.ttf" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/static/assets/space-grotesk-semibold.ttf" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/static/assets/fontawesome/webfonts/fa-brands-400.woff2" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/api/resources" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:11 INFO HTTP request method="GET" path="/static/assets/jetbrains-mono-regular.ttf" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:15 DEBUG Deleting process model="qwen3-8b" caller={caller.file="/build/pkg/model/process.go" caller.L=45 }
Jan 10 18:07:15 ERROR (deleteProcess) error while deleting process error=permission denied model="qwen3-8b" caller={caller.file="/build/pkg/model/process.go" caller.L=56 }
Jan 10 18:07:15 INFO HTTP request method="POST" path="/backend/shutdown" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:21 INFO HTTP request method="GET" path="/api/resources" status=200 caller={caller.file="/build/core/http/app.go" caller.L=111 }
Jan 10 18:07:24 DEBUG No system backends found caller={caller.file="/build/core/gallery/backends.go" caller.L=335 }
This is from starting up qwen3-8b, giving a short prompt, then manually stopping the model in the GUI. I didn't notice the permission denied error before. What would keep the process from being stopped?
Jan 10 18:07:15 ERROR (deleteProcess) error while deleting process error=permission denied model="qwen3-8b" caller={caller.file="/build/pkg/model/process.go" caller.L=56 }
interesting, it looks from the logs that we don't have enough permissions. My wild guess it is that comes from https://github.com/mudler/go-processmanager/blob/8b802d3ecf828b2bfed26d32fc00dc2dc3e4e23d/process.go#L176 .
Do you have permissions to send process signals? can you try to check if lowering permissions bar (e.g. by setting privileged) make it work? Just to try to nail out if it's rather a permission issue. I'll add more detailed error messages to trace this clearly
Okay - running it as a privileged container allows me to stop the model manually without errors.
I then removed the security context entirely, which caused the pod to run in non-privileged mode but in the host Linux namespace (standard for most Kubernetes setups), and the permission error came back.
For reference, here is my deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: localai
spec:
selector:
matchLabels:
app: localai
template:
metadata:
labels:
app: localai
spec:
# hostUsers: false # this was true at the start of the thread, I set to false for debugging
runtimeClassName: nvidia
containers:
- name: localai
image: localai/localai:v3.9.0-gpu-nvidia-cuda-13
command: ["./entrypoint.sh", "--debug"]
#securityContext:
#privileged: true
#allowPrivilegeEscalation: true # when these two are set and the others commented, it works
# allowPrivilegeEscalation: false
# capabilities:
# drop:
# - ALL
# seccompProfile:
# type: RuntimeDefault # these are normally in place, but commented out for debugging
env:
- name: LOCALAI_FORCE_META_BACKEND_CAPABILITY
value: nvidia
- name: LOCALAI_MAX_ACTIVE_BACKENDS
value: "1"
resources:
requests:
memory: "16Gi"
cpu: "4"
limits:
memory: "32Gi"
cpu: "16"
nvidia.com/gpu: "1"
ports:
- containerPort: 8080
# ...plus a bunch of volume mount stuff for configMap and persistence
we might need to add a specific stanza here, probably dependant from the container runtime to add CAP_KILL (https://unofficial-kubernetes.readthedocs.io/en/latest/concepts/policy/container-capabilities/):
securityContext:
capabilities:
add:
- KILL
To note, I've been running this without issues on k3s.
Explicitly adding the KILL capability had no effect. (I also tried CAP_KILL.)
I do have RKE2's CIS mode enabled, but I currently have the localai namespace excluded from that policy inrke2-pss.yaml.
I'll do some more digging to see if there's anything else peculiar about my setup.
Okay, I've done some experimentation and have narrowed it down to an issue with running a pod with hostUsers: false.
This pod spec, which is compatible with the built-in Restricted Pod Security Standard, works (i.e. allows LocalAI to kill model processes without encountering permission issues):
spec:
hostUsers: true
runtimeClassName: nvidia
containers:
- name: localai
image: localai/localai:v3.9.0-gpu-nvidia-cuda-13
command: ["./entrypoint.sh", "--debug"]
securityContext:
capabilities:
drop:
- ALL
allowPrivilegeEscalation: false
seccompProfile:
type: RuntimeDefault
runAsUser: 1007
runAsGroup: 1007
runAsNonRoot: true
This pod spec, which uses privileged mode (therefore incompatible with Restricted) within a user namespace, works:
spec:
hostUsers: false
runtimeClassName: nvidia
containers:
- name: localai
image: localai/localai:v3.9.0-gpu-nvidia-cuda-13
command: ["./entrypoint.sh", "--debug"]
securityContext:
privileged: true
However, I have been unable to find any combination of capabilities to add to an non-privileged, user-namespaced pod (ALL, KILL, SYS_ADMIN, etc.) that will give it the ability to kill the model processes. For some reason, it must be a privileged pod. I have also tried using securityContext.seccompProfile.type = Unconfined with no luck.
So, while I would like to run this in a user namespace as a non-privileged pod and don't see why it wouldn't work, I can run in the host namespace as a non-root user with no capabilities instead, which is just as good for me. Maybe this will help someone else trying to do something similar later on, though.
My guess is it's a Kubernetes issue rather than a LocalAI issue, so feel free to close if you agree.