LocalAI GPU VRAM detection

After https://github.com/mudler/LocalAI/pull/7891 and https://github.com/mudler/LocalAI/pull/7907, which has obviously very good upsides such as smaller images, and simplified UX (one image to rule them all), comes also with some drawbacks.

One of the few is that now we can't rely anymore on the GPU vendor binaries (such as rocm-smi, vulkaninfo etc) to be present in the container image. This leaves the user with 3 options I can think of at the moment:

Pre-execute a LocalAI script on start to install the required tools, or build a container image of LocalAI with the tools required included
Mount the binaries in the image manually from the host
Run LocalAI outside the container image in the host (with the tools installed)

This issue is mainly as a discussion point on how to tackle this. There is some trade-offs here, between the container image sizes, UX and VRAM monitoring features. We can of course still build separate container images for each of the GPU vendors. This might minimize impact of this, but would be nice to reduce the strain on the CI.

Jan 07 '26 15:01 mudler

For the moment restored the workflow to keep building images here: https://github.com/mudler/LocalAI/pull/7910 https://github.com/mudler/LocalAI/pull/7911.

However, master images works also with Nvidia GPUs without any issues (I've tested with the DGX Spark).

master images with Rocm, vulkan and intel won't display GPU usage.

Jan 07 '26 15:01 mudler

What about create a backend that exports VRAM info instead of doing inference? Then suggest the user installs that next to the RAM usage if they have a GPU.

Jan 07 '26 15:01 richiejp

UX + feature > container image size. Non-container use cases are no go for many people (myself included). Polluting OS with workload-specific binaries and configuration is something many people would like to avoid.

Also, from perspective of someone who mixes various GPU generations and often vendors - having one image that has both ROCm + Vulkan would be great improvement over separate image for ROCm and Vulkan. From that perspective, we might as well throw in CUDA and have one unified image with everything, even if it weights 30GB. Most models weights ten times of that (per model!) anyone looking for AI have to prepare storage anyway.

Jan 07 '26 16:01 Expro

we might as well throw in CUDA and have one unified image with everything, even if it weights 30GB.

It's a good line of thinking, but there are a lot of hidden issues with this. By default buildkit is really bad at creating images this size due to the layer merging phase unless something drastic has been done to improve it. It struggles even with a few GB in fact which only becomes visible you have a lot of image builds, so actually it would be nice for our CI/CD pipeline to go in the other direction and strip unnecessary stuff from the images.

Also there are use cases only using small models on edge devices where images this size are going to cause serious issues with both bandwidth and storage space.

Jan 08 '26 10:01 richiejp

I get comment about CI/CD issues, but regarding edge devices - I seriously doubt this use case is something LocalAI would be ever used. If anything, those would be most likely land on llama.cpp and single model, not whole system managing multiple backends and models.

Jan 08 '26 16:01 Expro

To be clear we do use it with NVIDIA Jetson with multiple backends. These devices are strong enough to run LocalAI's core and take the hit from using a HTTP (or websocket) API without noticing, but huge image sizes are problem.

For the actual use-cases think on-site safety and security monitoring systems and robotics. Typically these require multiple models to handle different time scales (e.g. agentic VLM being the slowest level, then realtime voice, or object detection to catch events as they happen and trigger slower running processes).

Jan 08 '26 16:01 richiejp