mlx-vlm icon indicating copy to clipboard operation
mlx-vlm copied to clipboard

Model Performance & Compatibility Report (2025-11-14)

Open jrp2014 opened this issue 3 months ago • 9 comments

This report summarizes the results of a test run on 31 vision-language models using mlx-vlm. The test involved processing a single image with a detailed prompt. Full results / diagnostics can be found at https://github.com/jrp2014/scripts/tree/main/src/output

Of the 31 models tested, 5 models failed to load or run, and several others produced unusable (gibberish, repetitive, or instruction-failed) output. This suggests potential compatibility issues with mlx-vlm, mlx, or transformers.

1. System & Library Versions

Component Version
OS Darwin 25.1.0 (macOS 26.1)
Chip Apple M4 Max (16 physical cores, 40 GPU cores)
Python 3.13.7
mlx 0.29.5
mlx-lm 0.28.4
mlx-vlm 0.3.5
transformers 4.57.1
huggingface-hub 0.36.0
Pillow 12.0.0

2. (a) Failures & Potential Bug Report

🔴 Category 1: Hard Failures (Load/Runtime Errors)

Five models failed to run entirely.

1.1: ImportError (Transformers Incompatibility)

These models failed during the loading of the processor, suggesting an incompatibility with transformers==4.57.1. The models' remote code attempts to import _validate_images_text_input_order from transformers.processing_utils, which appears to be unavailable.

  • mlx-community/Kimi-VL-A3B-Thinking-2506-bf16
  • mlx-community/Kimi-VL-A3B-Thinking-8bit

Error Log:

ImportError: cannot import name '_validate_images_text_input_order' from 'transformers.processing_utils' (.../transformers/processing_utils.py)

1.2: ImportError (TensorFlow Dependency)

These models require tensorflow, which is not part of the MLX environment, causing the processor loading to fail.

  • mlx-community/Molmo-7B-D-0924-8bit
  • mlx-community/Molmo-7B-D-0924-bf16

Error Log:

ImportError: This modeling file requires the following packages that were not found in your environment: tensorflow. Run `pip install tensorflow`

1.3: RuntimeError (Metal Malloc Failure)

This model failed during generation with a Metal error, attempting to allocate an impossibly large buffer (137GB). This points to a potential issue in mlx or mlx-vlm when handling this model's architecture.

  • mlx-community/Qwen2-VL-2B-Instruct-4bit

Error Log:

RuntimeError: [metal::malloc] Attempting to allocate 137220936192 bytes which is greater than the maximum allowed buffer size of 86586540032 bytes.

🟡 Category 2: Generation & Output Issues (Potential Bugs)

Several models ran but produced problematic output, suggesting issues with instruction following, repetition handling, or eos token logic in mlx-vlm.

  • Gibberish/Repetition:

    • mlx-community/Florence-2-large-ft-8bit: Output was a stream of <s> tokens.
    • prince-canuma/Florence-2-large-ft: Same as above, output was a stream of <s> tokens.
    • mlx-community/deepseek-vl2-8bit: Output was repetitive gibberish: And thetheimage, and theyoucan have a lot of theimg....
  • Prompt Repetition:

    • mlx-community/paligemma2-3b-pt-896-4bit: The model's entire output was just a repetition of the prompt.
  • Failed Instruction-Following (Hallucination/Debugging Output):

    • microsoft/Phi-3.5-vision-instruct: Produced a correct caption, but then hallucinated an unrelated Q&A about math combinations.
    • mlx-community/Apriel-1.5-15b-Thinker-6bit-MLX: Did not provide the caption, but instead output its reasoning steps for how it would create the caption.
    • mlx-community/Qwen3-VL-2B-Thinking-bf16: Same as Apriel, it output its internal reasoning monologue instead of the final requested items.

3. (b) Successful Model Output Quality

This section assesses the quality of the models that produced usable, descriptive output based on the image and the provided context.

⭐ Excellent

These models successfully integrated the visual content with the provided context (Chilham, 15th-century, etc.) and formatted the output correctly as requested (Caption, Description, Keywords).

  • mlx-community/InternVL3-14B-8bit: Provided a perfect caption, description, and keyword list, correctly identifying all key contextual elements.
  • mlx-community/gemma-3-27b-it-qat-4bit & ...-8bit: Both produced excellent, context-aware captions, descriptions, and extremely comprehensive keyword lists.
  • mlx-community/pixtral-12b-8bit & ...-bf16: Both provided high-quality, well-formatted captions, descriptions, and keywords, correctly using the context.
  • mlx-community/gemma-3n-E4B-it-bf16: Provided excellent, distinct "Visual Description" and "Contextual Information" sections, correctly identifying "15th-century half-timbered houses" and even adding extra (unprompted) context like "Chilham Castle".

✅ Good

These models produced accurate and relevant descriptions but were slightly less detailed or well-formatted than the "Excellent" category.

  • mlx-community/Idefics3-8B-Llama3-bf16: Provided a very good, descriptive paragraph that correctly identified "Chilham, Kent," "The Street," "15th century," and "half-timbered houses".
  • HuggingFaceTB/SmolVLM-Instruct & ...-bf16: Both gave detailed descriptions, identifying "half-timbered houses," "medieval architecture," "steeply pitched tile roofs," and "prominent brick chimneys". They also correctly identified specific cars like the "silver Ford Fiesta".
  • mlx-community/Phi-3.5-vision-instruct-bf16: Provided a single, accurate caption that correctly incorporated all key context points.

🫤 Decent

These models provided factually correct but minimal or poorly formatted output.

  • qnguyen3/nanoLLaVA: Output was brief but accurate, mentioning "15th-century houses," "steeply pitched tile roofs," "brick chimneys," and the correct timestamp.
  • meta-llama/Llama-3.2-11B-Vision-Instruct & ...-8bit: Identified "half-timbered houses" and "chimneys" but missed key context (like the date and specific location names) and used a strange bullet-point format.

❌ Poor

These models ran but produced low-quality or useless output.

  • mlx-community/paligemma2-10b-ft-docci-448-6bit & ...-bf16: Output was a very generic visual description ("a gray Ford Fiesta," "white house on the right") that completely missed the key contextual information about 15th-century, half-timbered architecture.
  • mlx-community/llava-v1.6-mistral-7b-8bit: The entire output was just: "The image is a photograph."

jrp2014 avatar Nov 14 '25 22:11 jrp2014

This is really great!

Could you include the image and prompts as well ?

Blaizzy avatar Nov 14 '25 23:11 Blaizzy

There are fuller details in the link, including prompts (html, md), image (html), and just the results in a tsv table.

jrp2014 avatar Nov 15 '25 15:11 jrp2014

I can't find the images

Blaizzy avatar Nov 17 '25 13:11 Blaizzy

An image should be embedded in the results.html file. It is not the original, as that is massive. You should be able to to test with images of your own, or suggest a private spot where I can drop off an example image.

jrp2014 avatar Nov 17 '25 21:11 jrp2014

Check Models Run Summary

Date: 2025-11-22 Device: Apple M4 Max (128GB RAM) Total Models: 32 Success Rate: 78% (25/32 passed)

🚨 Critical Failures (7 Models)

The following models failed to run. Detailed diagnostics are provided below to help resolve these issues.

1. Missing Weights (ValueError)

  • Model: microsoft/Florence-2-large-ft
  • Error: Missing 1 parameters: language_model.lm_head.weight.
  • Action: Verify model weights or loading logic for Florence-2 architecture.
View Traceback
ValueError: Missing 1 parameters: 
language_model.lm_head.weight.
  File "mlx/python/mlx/nn/layers/base.py", line 191, in load_weights
    raise ValueError(f"Missing {num_missing} parameters: \n{missing}.")

2. Library Compatibility (ImportError)

  • Models:
    • mlx-community/Kimi-VL-A3B-Thinking-2506-bf16
    • mlx-community/Kimi-VL-A3B-Thinking-8bit
  • Error: cannot import name '_validate_images_text_input_order' from 'transformers.processing_utils'
  • Context: This suggests the transformers version installed (4.57.1) might be too new or too old for the specific processing_kimi_vl.py script downloaded from the Hub.
  • Action: Check the model card for specific transformers version requirements.
View Traceback
ImportError: cannot import name '_validate_images_text_input_order' from 'transformers.processing_utils' (/opt/homebrew/Caskroom/miniconda/base/envs/mlx-vlm/lib/python3.13/site-packages/transformers/processing_utils.py)
  File "transformers/models/auto/processing_auto.py", line 387, in from_pretrained
  File "transformers/dynamic_module_utils.py", line 616, in get_class_from_dynamic_module
  File "processing_kimi_vl.py", line 25, in <module>

3. Missing Dependencies (ImportError)

  • Models:
    • mlx-community/Molmo-7B-D-0924-8bit
    • mlx-community/Molmo-7B-D-0924-bf16
  • Error: This modeling file requires the following packages that were not found in your environment: tensorflow
  • Context: The dynamic module for Molmo explicitly checks for TensorFlow.
  • Action: Install TensorFlow (pip install tensorflow) or look for a pure PyTorch/MLX implementation.
View Traceback
ImportError: This modeling file requires the following packages that were not found in your environment: tensorflow. Run `pip install tensorflow`
  File "transformers/dynamic_module_utils.py", line 260, in check_imports

4. Out of Memory (OOM)

  • Model: mlx-community/Qwen2-VL-2B-Instruct-4bit
  • Error: [metal::malloc] Attempting to allocate 135383101952 bytes which is greater than the maximum allowed buffer size of 86586540032 bytes.
  • Context: Attempting to allocate ~135GB on a 128GB machine (limit ~86GB for single buffer). This is highly unusual for a 2B parameter model (which should take <2GB).
  • Action: This indicates a bug in the model implementation or weight loading (e.g., infinite loop in tensor expansion or incorrect shape inference).
View Traceback
ValueError: [metal::malloc] Attempting to allocate 135383101952 bytes which is greater than the maximum allowed buffer size of 86586540032 bytes.

5. Template Error

  • Model: mlx-community/gemma-3-12b-pt-8bit
  • Error: Cannot use apply_chat_template because this processor does not have a chat template.
  • Context: The model's tokenizer_config.json is missing the chat_template field.
  • Action: Define a chat template manually or use the base model generation instead of chat.
View Traceback
ValueError: Cannot use apply_chat_template because this processor does not have a chat template.

⚠️ Quality Warnings

Models that ran but produced suboptimal output:

  • Repetitive Output: microsoft/Phi-3.5-vision-instruct (Generated garbage: "# instruction # solution...")
  • Context Ignored: 7 models (e.g., SmolVLM2, Paligemma2) failed to incorporate the provided context (Hastings, UK).
  • Formatting Issues: 3 models produced broken markdown or unknown tags.

🏆 Performance Highlights

  • Fastest (TPS): prince-canuma/Florence-2-large-ft (~327 TPS)
  • Most Efficient: qnguyen3/nanoLLaVA (4.5 GB Peak Memory)
  • Slowest: meta-llama/Llama-3.2-11B-Vision-Instruct (~3.9 TPS) - Note: Significantly slower than expected for M4 Max.

📝 Recommendations

  1. Fix Dependencies: Install tensorflow to support Molmo models, although this used to cause a lock up
  2. Investigate Kimi-VL: Check transformers version compatibility for Kimi-VL models.
  3. Debug Qwen2-VL: The OOM on a 2B model is anomalous; check for infinite loops or massive buffer allocations in the MLX implementation.
  4. Update Florence-2: Re-download or check the conversion for microsoft/Florence-2-large-ft.

jrp2014 avatar Nov 23 '25 00:11 jrp2014

Issue Report: MLX-VLM Model Performance and Failures

Date: 2025-11-25 System: Apple M4 Max (128GB RAM, 40 GPU Cores) OS: macOS Python: 3.13.9 MLX Version: 0.30.1.dev20251125+c9f4dc85 MLX-VLM Version: 0.3.7 Transformers Version: 4.57.3

Summary

A test of 33 MLX-VLM models was conducted using check_models.py. While many models performed well, several critical issues were identified, ranging from dependency conflicts (tensorflow) and import errors to massive OOM crashes and model quality failures (garbage output, context ignorance).

1. Critical Failures & Bugs

A. Out-Of-Memory (OOM) Crash on 2B Model

Model: mlx-community/Qwen2-VL-2B-Instruct-4bit Error: [metal::malloc] Attempting to allocate 135383101952 bytes which is greater than the maximum allowed buffer size of 86586540032 bytes. Observation: A 2B parameter model (even at 4-bit) requesting ~135GB of memory is highly anomalous. This suggests a severe bug in the model configuration, quantization, or the memory management for this specific architecture in MLX.

B. Unexpected TensorFlow Dependency

Models:

  • mlx-community/Molmo-7B-D-0924-8bit
  • mlx-community/Molmo-7B-D-0924-bf16 Error: ImportError: This modeling file requires the following packages that were not found in your environment: tensorflow. (NB: installing TensorFlow can cause the check_models.py script to lock on a mutex.) Observation: MLX models should ideally run without a TensorFlow dependency, especially on Apple Silicon where tensorflow-macos can be tricky or undesirable to mix with PyTorch/MLX environments. This limits the portability of these Molmo ports.

C. Transformers Library Incompatibility

Models:

  • mlx-community/Kimi-VL-A3B-Thinking-2506-bf16
  • mlx-community/Kimi-VL-A3B-Thinking-8bit Error: ImportError: cannot import name '_validate_images_text_input_order' from 'transformers.processing_utils' Observation: These models appear to rely on internal or deprecated transformers APIs that are not present in version 4.57.3. This indicates a need for the model maintainers to update their custom code or pin a specific transformers version.

D. Model Loading Failures (Missing Parameters)

Model: microsoft/Florence-2-large-ft Error: ValueError: Missing 1 parameters: language_model.lm_head.weight. Observation: The weights for the LM head appear to be missing or misnamed in the MLX conversion of this model.

E. Missing Chat Template

Model: mlx-community/gemma-3-12b-pt-8bit Error: ValueError: Cannot use apply_chat_template because this processor does not have a chat template. Observation: The processor config for this model lacks a defined chat template, causing apply_chat_template to fail.

2. Model Quality & Behavior Observations

A. Garbage Output

Model: prince-canuma/Florence-2-large-ft Output: <s><s><s><s>... (repeated indefinitely) Observation: While this version of Florence-2 loaded (unlike the Microsoft one), it failed to generate meaningful text, producing only start-of-sentence tokens. (I thought I'd deleted the Microsoft version but it seems to get downloaded again.)

B. Repetitive Loops

Model: mlx-community/paligemma2-10b-ft-docci-448-6bit Output: "The tiles are wet... The tiles are wet... The tiles are wet..." Observation: The model entered a degenerate repetition loop.

C. Context Ignorance

Models:

  • mlx-community/SmolVLM2-2.2B-Instruct-mlx
  • mlx-community/llava-v1.6-mistral-7b-8bit
  • mlx-community/paligemma2 variants
  • mlx-community/Llama-3.2-11B-Vision-Instruct-8bit Observation: These models completely ignored the provided context (location: Hastings, White Rock, UK) and generated generic descriptions. Llama-3.2-11B specifically complained about not seeing the image or context ("image itself is not visible to me"), suggesting a prompt formatting or image encoding issue specific to that model.

D. Excessive Verbosity / "Thinking"

Model: mlx-community/Qwen3-VL-2B-Thinking-bf16 Observation: This model produced a very long "thinking" trace before the actual answer. While interesting, it was flagged as "excessively verbose" (500 tokens). Users should be aware of this behavior for "Thinking" models.

E. High Quality Performers

Models:

  • mlx-community/pixtral-12b-8bit / bf16
  • mlx-community/Phi-3.5-vision-instruct-bf16 Observation: These models successfully integrated the context ("Hastings", "White Rock") into detailed, atmospheric descriptions. Pixtral in particular provided excellent structured output (Caption, Description, Keywords), though it was flagged for "excessive bullets".

AI Recommendations

  1. Investigate Qwen2-VL-2B OOM: This is the highest priority bug given the massive memory request.
  2. Fix Molmo Dependencies: Remove the hard TensorFlow dependency if possible.
  3. Update Kimi-VL: Patch the transformers import to be compatible with modern versions.
  4. Review Florence-2 Conversions: Both the Microsoft and Prince Canuma versions are broken (one fails to load, one generates garbage).

Full results:

https://github.com/jrp2014/check_models/tree/4020c32f0c360d8023e326cde9c956588d93d207/src/output

jrp2014 avatar Nov 25 '25 17:11 jrp2014

Thanks, this last report is nicely detailed

Let me breakdown into issues

Blaizzy avatar Nov 25 '25 20:11 Blaizzy

@jrp2014 Could you open an issue for the Qwen2-VL OOM with inputs and outputs and entire traceback?

Blaizzy avatar Nov 25 '25 20:11 Blaizzy

Sure. (The full tracebacks in a variety of formats are available in the link above.)

jrp2014 avatar Nov 25 '25 20:11 jrp2014