server : (experimental) vision support via libmtmd
Cont #12849
This is my first trial to bring libmtmd to server.cpp.
For the list of supported models, see: https://github.com/ggml-org/llama.cpp/blob/master/examples/llava/README.md
Implementation
(TODO: update this)
TODOs
- [x] automatically deactivate certain features if vision is enabled, we will work on these features later
- [x] implement hash function for image (to keep track of the cache)
- [ ] fix detokenize(server_inp_chunk)
- [ ] add more error handlings
- [ ] maybe support remote
image_urlin addition ofbase64
Demo
The server can be run with this command:
llama-server -hf ggml-org/gemma-3-4b-it-GGUF
Client code, ONLY base64 input is supported atm:
import json
import base64
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="sk-test", timeout=9999)
# Function to encode the image
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
# Path to your image
image_path = "../models/bliss.png"
# Getting the Base64 string
base64_image = encode_image(image_path)
response = client.chat.completions.create(
model="gpt-4o",
temperature=0.1,
stream=True,
messages=[
{
"role": "user",
"content": [
{ "type": "text", "text": "describe what you see in details" },
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{base64_image}",
},
},
],
}
],
)
for chunk in response:
print(chunk.choices[0].delta.content, end="")
print("\n\n")
With the image:
This will output:
Awesome work. However, I noticed the model usually ignores the text prompt when the prompt is the first in the conversation!
@qnixsynapse can you capture the raw http request? If the json paymoad is big, you can share it via a gist
@ngxson Will this be okay? https://gist.github.com/qnixsynapse/a4c61368d05180d3cb6c00f1baedf92c
at minimum I ask for this https://wiki.wireshark.org/hyper_text_transfer_protocol
not the raw IP packet
I don't have wireshark installed unfortunately. But you can still inspect for example:
POST /v1/chat/completions HTTP/1.1
Host: localhost:8080
Authorization: Bearer -key
Content-Type: application/json
Accept: */*
Accept-Encoding: gzip, deflate
User-Agent: Python/3.11 aiohttp/3.11.11
Content-Length: 615117
{"stream": true, "model": "Gemma", "messages": [{"role": "user", "content": [{"type": "text", "text": "Fact check the content in this image please."}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,<base64 png data from line 88>"}}]}], "stream_options": {"include_usage": true}, "temperature": 1.0, "top_p": 0.9}
HTTP/1.1 200 OK
Keep-Alive: timeout=5, max=100
Content-Type: text/event-stream
Server: llama.cpp
Transfer-Encoding: chunked
Access-Control-Allow-Origin:
@qnixsynapse I had a problem with my logic, which make it discard the text batch comes before the image batch.
It should be fixed now, could you give it a try?
Btw @ggerganov I'm noting here for visibility: while working on this PR, I realize that I can have 2 refactoring which can be done in their dedicated PR:
- The first one is quite simple, currently
server_taskis passed-by-copy in some places, we need to add somestd::move - The second one is a bit more tricky. Currently, we track everything using a
std::vector<llama_token>. However, for multimodal, I introduced the notion of "input chunks" along withlibmtmd. Server need to be adapted to work with chunks of tokens / embeddings instead of a simple list of tokens.
In the current PR, I'm kinda hacking this by havingserver_inp_chunkto wrap around one single text token (so most of the text-related logic are unchanged). But obviously this brings some complication when dealing with both text + image chunks. Do you have any better ideas to handle this?
And I also have a question regarding the logic around batch_view. IIRC, this is because sometimes the batch is too large for llama_decode to process, so we may want to reduce the input batch size (dynamically). However, we also internally split the batch into ubatch, so I'm wondering if this logic is now obsolete.
Edit: optionally one more refactoring, we should split llama-server into different compilation units, currently it may takes up to 20s to compile
@ngxson ~~Can you please refresh this branch with master?~~
Nvm. Ended up using your fork .. ~~working great!!!~~ 👍
On further testing, it seems that llama_batch_size exceeds sometimes in successive requests.
common/common.cpp:1161: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") failed
And I also have a question regarding the logic around batch_view. IIRC, this is because sometimes the batch is too large for llama_decode to process, so we may want to reduce the input batch size (dynamically). However, we also internally split the batch into ubatch, so I'm wondering if this logic is now obsolete.
This was useful mainly before the defragmentation support was added. The reason is that with time the KV cache can become highly fragmented and even if it has capacity for n_tokens it won't be able to find a contiguous slot, so attempting to split the batch into smaller chunks was a way to workaround this. With defragmentation enabled by default this is now rarely necessary. So yes, this should be simplified in a separate PR.
I'll think about the input chunk question today and let you know if I have any thoughts.
Seems like the batch decoding dies when you send a variety of longer requests.
common/common.cpp:1159: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") failed
Easiest way to trigger is to just wiggle the sequence length around, like with the example code
import json
import base64
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="sk-test", timeout=9999)
# Function to encode the image
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
# Path to your image
image_path = "../models/bliss.png"
# Getting the Base64 string
base64_image = encode_image(image_path)
for mult in [100, 200]: # (beinsezii) make sure it has to rebuild some cache the 2nd time
response = client.chat.completions.create(
model="gpt-4o",
temperature=0.1,
stream=True,
messages=[
{
"role": "user",
"content": [
{ "type": "text", "text": "describe what you see in details\n" * mult },
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{base64_image}",
},
},
],
}
],
)
for chunk in response:
print(chunk.choices[0].delta.content, end="")
print("\n\n")
Image hash (SHA1) is implemented in https://github.com/ggml-org/llama.cpp/pull/12898/commits/f5420e1d90bf7228c12bb5f8cd85808c4cb00ba8 , which should allow reusing KV cache for image tokens.
Would be nice if anyone can test this (or even better, write a python script to hammer this)
Image hash (SHA1) is implemented in f5420e1 , which should allow reusing KV cache for image tokens.
Is this actually implemented, or is it just a framework? Even using your own example with bliss.png it encodes every time as of f5420e1d90bf7228c12bb5f8cd85808c4cb00ba8
Is this actually implemented, or is it just a framework? Even using your own example with
bliss.pngit encodes every time as of f5420e1
What is your test code or request? I rerun the python test code in the PR description more than once, and it does not re-encoding the image
What is your test code or request? I rerun the python test code in the PR description more than once, and it does not re-encoding the image
Using your demo code with bliss.png I get image encoded in 400ms every single time I run the script, even with 100% identical requests. I'll do some picking see why mine does that. I ran cmake again just to confirm I'm on the latest commit.
Update: Running your demo code 3 times using the following server command
bin/llama-server -hf google/gemma-3-27b-it-qat-q4_0-gguf -c 8192 -ngl 99 --api-key "sk-test" -hft $(cat ~/.cache/huggingface/token)
I get this output
stdout.txt
I suppose technically there is a difference from 465 to 416 ms but I feel like that's just warmup.
Made a pure CPU build (because that's the only way to get mmproj on cpu?) and successive runs go from 9.6sec to 8.9sec from reported image encode time. A delta of 7% is not quite what I would expect from checksummed caching.
@Beinsezii from your log, seems like the image is invalidate each time, only the 12 tokens (I suppose text tokens) are preserved:
slot update_slots: id 0 | task 0 | kv cache rm [268, end)
Probably the has is not calculated correct. Can you print the hash near this line? (in server.cpp)
bmp.id = std::string((char *)result, 20);
printf("hash: %s\n", bmp.id.c_str()); // <== ADD THIS
printf("hash: %s\n", bmp.id.c_str());
~~@ngxson its either garbage or the checksum needs to be ascii encoded first~~
Upon looking further I think it is just not hex encoded.
srv update_slots: all slots are idle
hash: �MEV
��e�R�6���
srv params_from_: Chat format: Content-only
. . .
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
hash: ����5�@+8�'bLX6F9�
srv params_from_: Chat format: Content-only
@ngxson so I asked QwQ and it gave me
SHA1_CTX sha1_ctx;
SHA1Init(&sha1_ctx); // New line
SHA1Update(&sha1_ctx, (unsigned char const *)file.data(), file.size());
Which did indeed fix it on my end when tested against multiple images. Given that it worked for you otherwise, I'm assuming there's a race cond for the sha1 init?
possibly not important, but qwq was also sketched out by the fact that putting raw sha results into a string could lead to an early null byte.
@Beinsezii Hmm ok thanks for spotting that. It's not a race condition, but without SHA1Init I think the initial vector is initialized to a random value on heap/stack
@Beinsezii Hmm ok thanks for spotting that. It's not a race condition, but without
SHA1InitI think the initial vector is initialized to a random value on heap/stack
that makes sense but now im even more confused as to why it was consistent on your end lol. im not even sure zeroed pages would work because surely enough stuff happens over multiple requests it would use reclaimed memory. maybe your discrete rng lava lamps are unplugged
A hash can be an arbitrary byte sequence, right? It's not necessarily a valid string. You probably want to print it out byte by byte, using something like:
printf("hash = ");
for (int i = 0, n = sizeof result; i < n; ++i) {
printf("%02hhx", result[i]);
}
printf("\n");
A hash can be an arbitrary byte sequence, right? It's not necessarily a valid string.
Yes but storing it hex string is easier for debugging, so it must be converted to a hex string to prevent potential problems null byte. This conversion is currently missing in the code.
Significant changes in last commits:
- bump to latest
master, we're now supporting Pixtral 12B - using FNV hash, using the image bitmap (NOT the raw file data)
- support large image batches, so models like granite-vision or minicpm-v won't crash
bump to latest
master, we're now supporting Pixtral 12B
curious if small 3.1 is the same vision mechanism or if that will need more work as well
Update: seems like Pixtral is broken. It thinks bliss.png is a "blue and green grid" and other images it just interprets as corrupted or noise.
Update: seems like Pixtral is broke
Which backend you're using? Does it give the same result when running via llama-mtmd-cli?
Which backend you're using? Does it give the same result when running via
llama-mtmd-cli?
ROCm and it seems to be temperature dependent?
Like 0.1 temp it will reply
It seems we're starting with an image of a serene landscape featuring a clear blue sky transitioning into lush green fields below.
where temp 1.0 is
it seems that the image you've shared contains a pattern of repeating colors and shapes that might be difficult to describe precisely without more context.
Meanwhile on CPU it always recognizes it as a landscape even at temp 2.0. ROCm at 2.0 claims there isn't an image at all lol. I imagine something is wrong because I don't think temp should swing the results that hard for such a simple prompt.
Haven't tried Vulkan yet. Identical behavior with mtmd-cli. Shall I open an issue?
Slight update: even pure textually the model just seems really bad on ROCm with a moderate or high temp. ~~I wonder if this is just fp16 vs fp32 compute?~~ alright even with CUDA_F16 off and f32 k/v cache the whole model is completely unusable on ROCm with even a mild temp lol.
I'm getting wildly incorrect outputs with Pixtral. I'm using the server API and llama-mtmd-cli and the server seems to completely ignore that I've sent an image, but the CLI outputs garbage - mentioning either a mosaic of colors or just outputting complete nonsense. This image in particular made it go nuts, counting up from 2013 until generation stopped.
I'm using a 7900xtx, compiled with ROCm. Running it on CPU and GPU produced different, but still incorrect, results.
@HAV0X1014 if you're trying cpu, try a clean cpu only build without HIP compiled at all. For some reason compiling with HIP but using --ngl 0 can still break some models. GLM 4 is the same way.
For the problem with pixtral, please follow: https://github.com/ggml-org/llama.cpp/pull/13065#issuecomment-2826580374
Is there a way to pass images via non-chat completion yet? I see in the server readme at one point /completions could substitute images like
http post http://127.0.0.1:8080/completion --content-type application/json {
prompt: 'What is in this image?[img-12]',
"image_data": [{"data": (open /tmp/bliss.png | encode base64), "id": 12}]
}
but I don't believe that's functional anymore.
@Beinsezii I don't spend my time adding /completions because this PR already took me a lot of time