mlx-vlm It seems that version v0.1.19 does not follow instructions and only describes images.

Hi,

It seems that version v0.1.19 does not follow instructions and only describes images. In version 0.1.19, no matter what prompt is entered, it only outputs image descriptions in the same format.

prompt

Describe the image concisely. If this image is slightly related to restaurant, print PASS. If this image is completely not related to a restaurant, print FAIL.

Result from v0.1.17

Here's an analysis of the image:

The image shows a wooden roller coaster with many people riding it. The coaster is at the top of a steep incline against a clear blue sky.

**Verdict: PASS**

The image depicts an amusement park, which is often associated with restaurants and food vendors.

Result from v0.0.19

Here's a description of the image:

**Overall Impression:** The image shows a wooden roller coaster, likely in a theme park or amusement park. The focus is on the top of the structure and the train of cars ascending a steep incline.

**Key Elements:**

*   **Roller Coaster Structure:** The roller coaster is made of wood, with a classic, sturdy-looking design. The wooden framework is intricate and visible.
*   **Roller Coaster Train:** A train of cars is ascending the incline. The cars are filled with people, and some are visible.
*   **Background:** The background is a clear blue sky.

**Overall Impression:** The image conveys a sense of excitement and adventure.

import mlx.core as mx
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load the model
model_path = "mlx-community/gemma-3-12b-it-6bit"
model, processor = load(model_path)
config = load_config(model_path)

# Prepare input
image = ["./theme1.jpg"]
prompt = "Describe the image concisely. If this image is slightly related to restaurant, print PASS. If this image is completely not related to a restaurant, print FAIL."

# Apply chat template
formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=len(image)
)


output = generate(model, processor, formatted_prompt, image, max_tokens=256, verbose=False)
print(output)

Mar 19 '25 23:03 swlee60

I noticed the same with Gemma, but with Mistral and Qwen it follows instructions. May be this is nothing to do with MLX-VLM, but model related.

Mar 20 '25 07:03 abaranovskis-redsamurai

Still debugging, there is a bigger discussion.

Will check in detail during the weekend.

Mar 20 '25 21:03 Blaizzy

@swlee60 @abaranovskis-redsamurai

Related issue here. We are working on a fix :)

Mar 21 '25 11:03 FL33TW00D

Fixed

Nov 11 '25 00:11 Blaizzy