mistral.rs icon indicating copy to clipboard operation
mistral.rs copied to clipboard

OAI compat endpoint w/ Images?

Open bioshazard opened this issue 11 months ago • 7 comments

Describe the bug

I finally got llama 3.2 11b working and /image works great with -i but using it as an OAI compat endpoint doesn't seem to accept b64 images. I get this error:

ERROR mistralrs_core::engine: prompt step - Model failed with error: Msg("The number of images in each batch [0] should be the same as the number of images [1]. The model cannot support a different number of images per patch. Perhaps you forgot a `<|image|>` tag?")

With this messages payload:

[
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "who is this?"
      },
      {
        "type": "image_url",
        "image_url": {
          "url": "(b64 dataurl)"
        }
      }
    ]
  }
]

I see no mention of image_url support in the HTTP.md so maybe this is not supported for OAI compat endpoint?

https://github.com/EricLBuehler/mistral.rs/blob/master/docs/HTTP.md

Latest commit or version

Using docker: ghcr.io/ericlbuehler/mistral.rs:cuda-86-sha-b38c72c

bioshazard avatar Dec 27 '24 03:12 bioshazard

Hmm maybe this example gives the hint that I need to include <|image_1|>\n in my text payload?? Even if that worked, its very strange for OAI compat endpoint. Recommend inferring image_1 etc in text payload if necessary. Will look into code base in case I can contribute anywhere.

https://github.com/EricLBuehler/mistral.rs/blob/master/examples/server/phi3v_base64.py#L61

bioshazard avatar Dec 27 '24 03:12 bioshazard

Yep, it works if I include <|image|> in my text payload in OpenWebUI, but I gotta say that is not the OAI compat I'd expect. I can work with this for now tho, but will leave the bug report open as there is room to more directly meet the OAI compat standard. Thanks again for this excellent project to enable me to use 3.2 11b on my 3090

bioshazard avatar Dec 27 '24 03:12 bioshazard

Found the error line:

https://github.com/EricLBuehler/mistral.rs/blob/d28ddf96d9dc79469cd6f1f856ea4d2370819a80/mistralrs-core/src/vision_models/mllama/inputs_processor.rs#L285

bioshazard avatar Dec 29 '24 16:12 bioshazard

So I am thinking that mllama expects those image tokens within the text section. So my expectation for what I'd want out of this is to inject the necessary token at the oai payload processing step (rather than within this mllama processing step) if the token is not already present when an image content is provided.

This would accommodate how I have seen the schema not require the image token in the text content (which is necessary for a vanilla oai compat consumption by open web UI). And it would also accommodate existing users that already include the token.

I have never messed with rust but if someone doesn't beat me to it I might try my hand at what I'm suggesting.

bioshazard avatar Dec 29 '24 16:12 bioshazard

I think I worked this out with Claude. Will attempt to add a step in here to detect and inject image tokens to the text part if not present.

https://github.com/EricLBuehler/mistral.rs/blob/d28ddf96d9dc79469cd6f1f856ea4d2370819a80/mistralrs-server/src/chat_completion.rs#L156

bioshazard avatar Dec 29 '24 17:12 bioshazard

Opened a PR with a minimal check-inject addition. Hope you can make use of it. It meets at least my own needs for using naturally in OpenWebUI.

bioshazard avatar Dec 30 '24 20:12 bioshazard

I am using qwen2.5-vl, and I found that it can run in -i mode, but it cannot run on the openai endpoint.

Bit0r avatar Mar 17 '25 17:03 Bit0r