llama.cpp server : (experimental) vision support via libmtmd

Cont #12849

This is my first trial to bring libmtmd to server.cpp.

For the list of supported models, see: https://github.com/ggml-org/llama.cpp/blob/master/examples/llava/README.md

Implementation

(TODO: update this)

TODOs

[x] automatically deactivate certain features if vision is enabled, we will work on these features later
[x] implement hash function for image (to keep track of the cache)
[ ] fix detokenize(server_inp_chunk)
[ ] add more error handlings
[ ] maybe support remote image_url in addition of base64

Demo

The server can be run with this command:

llama-server -hf ggml-org/gemma-3-4b-it-GGUF

Client code, ONLY base64 input is supported atm:

import json
import base64
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="sk-test", timeout=9999)

# Function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


# Path to your image
image_path = "../models/bliss.png"

# Getting the Base64 string
base64_image = encode_image(image_path)

response = client.chat.completions.create(
    model="gpt-4o",
    temperature=0.1,
    stream=True,
    messages=[
        {
            "role": "user",
            "content": [
                { "type": "text", "text": "describe what you see in details" },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{base64_image}",
                    },
                },
            ],
        }
    ],
)

for chunk in response:
    print(chunk.choices[0].delta.content, end="")

print("\n\n")

With the image:

bliss

This will output:

Apr 11 '25 16:04 ngxson

Awesome work. However, I noticed the model usually ignores the text prompt when the prompt is the first in the conversation!

Apr 12 '25 14:04 qnixsynapse

@qnixsynapse can you capture the raw http request? If the json paymoad is big, you can share it via a gist

Apr 12 '25 15:04 ngxson

@ngxson Will this be okay? https://gist.github.com/qnixsynapse/a4c61368d05180d3cb6c00f1baedf92c

Apr 12 '25 15:04 qnixsynapse

at minimum I ask for this https://wiki.wireshark.org/hyper_text_transfer_protocol

not the raw IP packet

Apr 12 '25 16:04 ngxson

I don't have wireshark installed unfortunately. But you can still inspect for example:

POST /v1/chat/completions HTTP/1.1
Host: localhost:8080
Authorization: Bearer -key
Content-Type: application/json
Accept: */*
Accept-Encoding: gzip, deflate
User-Agent: Python/3.11 aiohttp/3.11.11
Content-Length: 615117

{"stream": true, "model": "Gemma", "messages": [{"role": "user", "content": [{"type": "text", "text": "Fact check the content in this image please."}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,<base64 png data from line 88>"}}]}], "stream_options": {"include_usage": true}, "temperature": 1.0, "top_p": 0.9}

HTTP/1.1 200 OK
Keep-Alive: timeout=5, max=100
Content-Type: text/event-stream
Server: llama.cpp
Transfer-Encoding: chunked
Access-Control-Allow-Origin:

Apr 12 '25 16:04 qnixsynapse

@qnixsynapse I had a problem with my logic, which make it discard the text batch comes before the image batch.

It should be fixed now, could you give it a try?

Apr 13 '25 21:04 ngxson

Btw @ggerganov I'm noting here for visibility: while working on this PR, I realize that I can have 2 refactoring which can be done in their dedicated PR:

The first one is quite simple, currently server_task is passed-by-copy in some places, we need to add some std::move
The second one is a bit more tricky. Currently, we track everything using a std::vector<llama_token>. However, for multimodal, I introduced the notion of "input chunks" along with libmtmd. Server need to be adapted to work with chunks of tokens / embeddings instead of a simple list of tokens.
In the current PR, I'm kinda hacking this by having server_inp_chunk to wrap around one single text token (so most of the text-related logic are unchanged). But obviously this brings some complication when dealing with both text + image chunks. Do you have any better ideas to handle this?

And I also have a question regarding the logic around batch_view. IIRC, this is because sometimes the batch is too large for llama_decode to process, so we may want to reduce the input batch size (dynamically). However, we also internally split the batch into ubatch, so I'm wondering if this logic is now obsolete.

Edit: optionally one more refactoring, we should split llama-server into different compilation units, currently it may takes up to 20s to compile

Apr 13 '25 22:04 ngxson

@ngxson ~~Can you please refresh this branch with master?~~

Nvm. Ended up using your fork .. ~~working great!!!~~ 👍

On further testing, it seems that llama_batch_size exceeds sometimes in successive requests.

common/common.cpp:1161: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") failed

Apr 14 '25 05:04 qnixsynapse

And I also have a question regarding the logic around batch_view. IIRC, this is because sometimes the batch is too large for llama_decode to process, so we may want to reduce the input batch size (dynamically). However, we also internally split the batch into ubatch, so I'm wondering if this logic is now obsolete.

This was useful mainly before the defragmentation support was added. The reason is that with time the KV cache can become highly fragmented and even if it has capacity for n_tokens it won't be able to find a contiguous slot, so attempting to split the batch into smaller chunks was a way to workaround this. With defragmentation enabled by default this is now rarely necessary. So yes, this should be simplified in a separate PR.

I'll think about the input chunk question today and let you know if I have any thoughts.

Apr 14 '25 06:04 ggerganov

Seems like the batch decoding dies when you send a variety of longer requests.

common/common.cpp:1159: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") failed

Easiest way to trigger is to just wiggle the sequence length around, like with the example code

import json
import base64
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="sk-test", timeout=9999)

# Function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


# Path to your image
image_path = "../models/bliss.png"

# Getting the Base64 string
base64_image = encode_image(image_path)

for mult in [100, 200]:  # (beinsezii) make sure it has to rebuild some cache the 2nd time
    response = client.chat.completions.create(
        model="gpt-4o",
        temperature=0.1,
        stream=True,
        messages=[
            {
                "role": "user",
                "content": [
                    { "type": "text", "text": "describe what you see in details\n" * mult },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{base64_image}",
                        },
                    },
                ],
            }
        ],
    )

    for chunk in response:
        print(chunk.choices[0].delta.content, end="")

    print("\n\n")

Apr 15 '25 08:04 Beinsezii

Image hash (SHA1) is implemented in https://github.com/ggml-org/llama.cpp/pull/12898/commits/f5420e1d90bf7228c12bb5f8cd85808c4cb00ba8 , which should allow reusing KV cache for image tokens.

Would be nice if anyone can test this (or even better, write a python script to hammer this)

Apr 21 '25 21:04 ngxson

Image hash (SHA1) is implemented in f5420e1 , which should allow reusing KV cache for image tokens.

Is this actually implemented, or is it just a framework? Even using your own example with bliss.png it encodes every time as of f5420e1d90bf7228c12bb5f8cd85808c4cb00ba8

Apr 21 '25 22:04 Beinsezii

Is this actually implemented, or is it just a framework? Even using your own example with bliss.png it encodes every time as of f5420e1

What is your test code or request? I rerun the python test code in the PR description more than once, and it does not re-encoding the image

Apr 22 '25 07:04 ngxson

What is your test code or request? I rerun the python test code in the PR description more than once, and it does not re-encoding the image

Using your demo code with bliss.png I get image encoded in 400ms every single time I run the script, even with 100% identical requests. I'll do some picking see why mine does that. I ran cmake again just to confirm I'm on the latest commit.

Update: Running your demo code 3 times using the following server command bin/llama-server -hf google/gemma-3-27b-it-qat-q4_0-gguf -c 8192 -ngl 99 --api-key "sk-test" -hft $(cat ~/.cache/huggingface/token) I get this output stdout.txt I suppose technically there is a difference from 465 to 416 ms but I feel like that's just warmup.

Made a pure CPU build (because that's the only way to get mmproj on cpu?) and successive runs go from 9.6sec to 8.9sec from reported image encode time. A delta of 7% is not quite what I would expect from checksummed caching.

Apr 22 '25 10:04 Beinsezii

@Beinsezii from your log, seems like the image is invalidate each time, only the 12 tokens (I suppose text tokens) are preserved:

slot update_slots: id  0 | task 0 | kv cache rm [268, end)

Probably the has is not calculated correct. Can you print the hash near this line? (in server.cpp)

bmp.id = std::string((char *)result, 20);
printf("hash: %s\n", bmp.id.c_str()); // <== ADD THIS

Apr 22 '25 14:04 ngxson

printf("hash: %s\n", bmp.id.c_str());

~~@ngxson its either garbage or the checksum needs to be ascii encoded first~~

Upon looking further I think it is just not hex encoded.

srv  update_slots: all slots are idle
hash: �MEV
��e�R�6���
srv  params_from_: Chat format: Content-only

. . .

srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
hash: ����5�@+8�'bLX6F9�
srv  params_from_: Chat format: Content-only

Apr 22 '25 17:04 Beinsezii

@ngxson so I asked QwQ and it gave me

SHA1_CTX sha1_ctx;
SHA1Init(&sha1_ctx);  // New line
SHA1Update(&sha1_ctx, (unsigned char const *)file.data(), file.size());

Which did indeed fix it on my end when tested against multiple images. Given that it worked for you otherwise, I'm assuming there's a race cond for the sha1 init?

possibly not important, but qwq was also sketched out by the fact that putting raw sha results into a string could lead to an early null byte.

Apr 22 '25 17:04 Beinsezii

@Beinsezii Hmm ok thanks for spotting that. It's not a race condition, but without SHA1Init I think the initial vector is initialized to a random value on heap/stack

Apr 22 '25 19:04 ngxson

@Beinsezii Hmm ok thanks for spotting that. It's not a race condition, but without SHA1Init I think the initial vector is initialized to a random value on heap/stack

that makes sense but now im even more confused as to why it was consistent on your end lol. im not even sure zeroed pages would work because surely enough stuff happens over multiple requests it would use reclaimed memory. maybe your discrete rng lava lamps are unplugged

Apr 22 '25 20:04 Beinsezii

A hash can be an arbitrary byte sequence, right? It's not necessarily a valid string. You probably want to print it out byte by byte, using something like:

printf("hash = ");
for (int i = 0, n = sizeof result; i < n; ++i) {
  printf("%02hhx", result[i]);
}
printf("\n");

Apr 22 '25 20:04 andportnoy

A hash can be an arbitrary byte sequence, right? It's not necessarily a valid string.

Yes but storing it hex string is easier for debugging, so it must be converted to a hex string to prevent potential problems null byte. This conversion is currently missing in the code.

Apr 23 '25 06:04 ngxson

Significant changes in last commits:

bump to latest master, we're now supporting Pixtral 12B
using FNV hash, using the image bitmap (NOT the raw file data)
support large image batches, so models like granite-vision or minicpm-v won't crash

Apr 23 '25 20:04 ngxson

bump to latest master, we're now supporting Pixtral 12B

curious if small 3.1 is the same vision mechanism or if that will need more work as well

Update: seems like Pixtral is broken. It thinks bliss.png is a "blue and green grid" and other images it just interprets as corrupted or noise.

Apr 23 '25 20:04 Beinsezii

Update: seems like Pixtral is broke

Which backend you're using? Does it give the same result when running via llama-mtmd-cli?

Apr 23 '25 21:04 ngxson

Which backend you're using? Does it give the same result when running via llama-mtmd-cli?

ROCm and it seems to be temperature dependent?

Like 0.1 temp it will reply It seems we're starting with an image of a serene landscape featuring a clear blue sky transitioning into lush green fields below. where temp 1.0 is it seems that the image you've shared contains a pattern of repeating colors and shapes that might be difficult to describe precisely without more context.

Meanwhile on CPU it always recognizes it as a landscape even at temp 2.0. ROCm at 2.0 claims there isn't an image at all lol. I imagine something is wrong because I don't think temp should swing the results that hard for such a simple prompt.

Haven't tried Vulkan yet. Identical behavior with mtmd-cli. Shall I open an issue?

Slight update: even pure textually the model just seems really bad on ROCm with a moderate or high temp. ~~I wonder if this is just fp16 vs fp32 compute?~~ alright even with CUDA_F16 off and f32 k/v cache the whole model is completely unusable on ROCm with even a mild temp lol.

Apr 24 '25 01:04 Beinsezii

I'm getting wildly incorrect outputs with Pixtral. I'm using the server API and llama-mtmd-cli and the server seems to completely ignore that I've sent an image, but the CLI outputs garbage - mentioning either a mosaic of colors or just outputting complete nonsense. This image in particular made it go nuts, counting up from 2013 until generation stopped. IMG_20240224_194019

I'm using a 7900xtx, compiled with ROCm. Running it on CPU and GPU produced different, but still incorrect, results.

Apr 24 '25 02:04 HAV0X1014

@HAV0X1014 if you're trying cpu, try a clean cpu only build without HIP compiled at all. For some reason compiling with HIP but using --ngl 0 can still break some models. GLM 4 is the same way.

Apr 24 '25 03:04 Beinsezii

For the problem with pixtral, please follow: https://github.com/ggml-org/llama.cpp/pull/13065#issuecomment-2826580374

Apr 24 '25 07:04 ngxson

Is there a way to pass images via non-chat completion yet? I see in the server readme at one point /completions could substitute images like

http post http://127.0.0.1:8080/completion --content-type application/json {
    prompt: 'What is in this image?[img-12]',
    "image_data": [{"data": (open /tmp/bliss.png | encode base64), "id": 12}]
}

but I don't believe that's functional anymore.

Apr 24 '25 08:04 Beinsezii

@Beinsezii I don't spend my time adding /completions because this PR already took me a lot of time

Apr 24 '25 08:04 ngxson