mistral.rs
mistral.rs copied to clipboard
OAI compat endpoint w/ Images?
Describe the bug
I finally got llama 3.2 11b working and /image works great with -i but using it as an OAI compat endpoint doesn't seem to accept b64 images. I get this error:
ERROR mistralrs_core::engine: prompt step - Model failed with error: Msg("The number of images in each batch [0] should be the same as the number of images [1]. The model cannot support a different number of images per patch. Perhaps you forgot a `<|image|>` tag?")
With this messages payload:
[
{
"role": "user",
"content": [
{
"type": "text",
"text": "who is this?"
},
{
"type": "image_url",
"image_url": {
"url": "(b64 dataurl)"
}
}
]
}
]
I see no mention of image_url support in the HTTP.md so maybe this is not supported for OAI compat endpoint?
https://github.com/EricLBuehler/mistral.rs/blob/master/docs/HTTP.md
Latest commit or version
Using docker: ghcr.io/ericlbuehler/mistral.rs:cuda-86-sha-b38c72c
Hmm maybe this example gives the hint that I need to include <|image_1|>\n in my text payload?? Even if that worked, its very strange for OAI compat endpoint. Recommend inferring image_1 etc in text payload if necessary. Will look into code base in case I can contribute anywhere.
https://github.com/EricLBuehler/mistral.rs/blob/master/examples/server/phi3v_base64.py#L61
Yep, it works if I include <|image|> in my text payload in OpenWebUI, but I gotta say that is not the OAI compat I'd expect. I can work with this for now tho, but will leave the bug report open as there is room to more directly meet the OAI compat standard. Thanks again for this excellent project to enable me to use 3.2 11b on my 3090
Found the error line:
https://github.com/EricLBuehler/mistral.rs/blob/d28ddf96d9dc79469cd6f1f856ea4d2370819a80/mistralrs-core/src/vision_models/mllama/inputs_processor.rs#L285
So I am thinking that mllama expects those image tokens within the text section. So my expectation for what I'd want out of this is to inject the necessary token at the oai payload processing step (rather than within this mllama processing step) if the token is not already present when an image content is provided.
This would accommodate how I have seen the schema not require the image token in the text content (which is necessary for a vanilla oai compat consumption by open web UI). And it would also accommodate existing users that already include the token.
I have never messed with rust but if someone doesn't beat me to it I might try my hand at what I'm suggesting.
I think I worked this out with Claude. Will attempt to add a step in here to detect and inject image tokens to the text part if not present.
https://github.com/EricLBuehler/mistral.rs/blob/d28ddf96d9dc79469cd6f1f856ea4d2370819a80/mistralrs-server/src/chat_completion.rs#L156
Opened a PR with a minimal check-inject addition. Hope you can make use of it. It meets at least my own needs for using naturally in OpenWebUI.
I am using qwen2.5-vl, and I found that it can run in -i mode, but it cannot run on the openai endpoint.