Model Performance & Compatibility Report (2025-11-14)
This report summarizes the results of a test run on 31 vision-language models using mlx-vlm. The test involved processing a single image with a detailed prompt. Full results / diagnostics can be found at https://github.com/jrp2014/scripts/tree/main/src/output
Of the 31 models tested, 5 models failed to load or run, and several others produced unusable (gibberish, repetitive, or instruction-failed) output. This suggests potential compatibility issues with mlx-vlm, mlx, or transformers.
1. System & Library Versions
| Component | Version |
|---|---|
| OS | Darwin 25.1.0 (macOS 26.1) |
| Chip | Apple M4 Max (16 physical cores, 40 GPU cores) |
| Python | 3.13.7 |
mlx |
0.29.5 |
mlx-lm |
0.28.4 |
mlx-vlm |
0.3.5 |
transformers |
4.57.1 |
huggingface-hub |
0.36.0 |
Pillow |
12.0.0 |
2. (a) Failures & Potential Bug Report
🔴 Category 1: Hard Failures (Load/Runtime Errors)
Five models failed to run entirely.
1.1: ImportError (Transformers Incompatibility)
These models failed during the loading of the processor, suggesting an incompatibility with transformers==4.57.1. The models' remote code attempts to import _validate_images_text_input_order from transformers.processing_utils, which appears to be unavailable.
-
mlx-community/Kimi-VL-A3B-Thinking-2506-bf16 -
mlx-community/Kimi-VL-A3B-Thinking-8bit
Error Log:
ImportError: cannot import name '_validate_images_text_input_order' from 'transformers.processing_utils' (.../transformers/processing_utils.py)
1.2: ImportError (TensorFlow Dependency)
These models require tensorflow, which is not part of the MLX environment, causing the processor loading to fail.
-
mlx-community/Molmo-7B-D-0924-8bit -
mlx-community/Molmo-7B-D-0924-bf16
Error Log:
ImportError: This modeling file requires the following packages that were not found in your environment: tensorflow. Run `pip install tensorflow`
1.3: RuntimeError (Metal Malloc Failure)
This model failed during generation with a Metal error, attempting to allocate an impossibly large buffer (137GB). This points to a potential issue in mlx or mlx-vlm when handling this model's architecture.
-
mlx-community/Qwen2-VL-2B-Instruct-4bit
Error Log:
RuntimeError: [metal::malloc] Attempting to allocate 137220936192 bytes which is greater than the maximum allowed buffer size of 86586540032 bytes.
🟡 Category 2: Generation & Output Issues (Potential Bugs)
Several models ran but produced problematic output, suggesting issues with instruction following, repetition handling, or eos token logic in mlx-vlm.
-
Gibberish/Repetition:
-
mlx-community/Florence-2-large-ft-8bit: Output was a stream of<s>tokens. -
prince-canuma/Florence-2-large-ft: Same as above, output was a stream of<s>tokens. -
mlx-community/deepseek-vl2-8bit: Output was repetitive gibberish:And thetheimage, and theyoucan have a lot of theimg....
-
-
Prompt Repetition:
-
mlx-community/paligemma2-3b-pt-896-4bit: The model's entire output was just a repetition of the prompt.
-
-
Failed Instruction-Following (Hallucination/Debugging Output):
-
microsoft/Phi-3.5-vision-instruct: Produced a correct caption, but then hallucinated an unrelated Q&A about math combinations. -
mlx-community/Apriel-1.5-15b-Thinker-6bit-MLX: Did not provide the caption, but instead output its reasoning steps for how it would create the caption. -
mlx-community/Qwen3-VL-2B-Thinking-bf16: Same asApriel, it output its internal reasoning monologue instead of the final requested items.
-
3. (b) Successful Model Output Quality
This section assesses the quality of the models that produced usable, descriptive output based on the image and the provided context.
⭐ Excellent
These models successfully integrated the visual content with the provided context (Chilham, 15th-century, etc.) and formatted the output correctly as requested (Caption, Description, Keywords).
-
mlx-community/InternVL3-14B-8bit: Provided a perfect caption, description, and keyword list, correctly identifying all key contextual elements. -
mlx-community/gemma-3-27b-it-qat-4bit&...-8bit: Both produced excellent, context-aware captions, descriptions, and extremely comprehensive keyword lists. -
mlx-community/pixtral-12b-8bit&...-bf16: Both provided high-quality, well-formatted captions, descriptions, and keywords, correctly using the context. -
mlx-community/gemma-3n-E4B-it-bf16: Provided excellent, distinct "Visual Description" and "Contextual Information" sections, correctly identifying "15th-century half-timbered houses" and even adding extra (unprompted) context like "Chilham Castle".
✅ Good
These models produced accurate and relevant descriptions but were slightly less detailed or well-formatted than the "Excellent" category.
-
mlx-community/Idefics3-8B-Llama3-bf16: Provided a very good, descriptive paragraph that correctly identified "Chilham, Kent," "The Street," "15th century," and "half-timbered houses". -
HuggingFaceTB/SmolVLM-Instruct&...-bf16: Both gave detailed descriptions, identifying "half-timbered houses," "medieval architecture," "steeply pitched tile roofs," and "prominent brick chimneys". They also correctly identified specific cars like the "silver Ford Fiesta". -
mlx-community/Phi-3.5-vision-instruct-bf16: Provided a single, accurate caption that correctly incorporated all key context points.
🫤 Decent
These models provided factually correct but minimal or poorly formatted output.
-
qnguyen3/nanoLLaVA: Output was brief but accurate, mentioning "15th-century houses," "steeply pitched tile roofs," "brick chimneys," and the correct timestamp. -
meta-llama/Llama-3.2-11B-Vision-Instruct&...-8bit: Identified "half-timbered houses" and "chimneys" but missed key context (like the date and specific location names) and used a strange bullet-point format.
❌ Poor
These models ran but produced low-quality or useless output.
-
mlx-community/paligemma2-10b-ft-docci-448-6bit&...-bf16: Output was a very generic visual description ("a gray Ford Fiesta," "white house on the right") that completely missed the key contextual information about 15th-century, half-timbered architecture. -
mlx-community/llava-v1.6-mistral-7b-8bit: The entire output was just: "The image is a photograph."
This is really great!
Could you include the image and prompts as well ?
There are fuller details in the link, including prompts (html, md), image (html), and just the results in a tsv table.
I can't find the images
An image should be embedded in the results.html file. It is not the original, as that is massive. You should be able to to test with images of your own, or suggest a private spot where I can drop off an example image.
Check Models Run Summary
Date: 2025-11-22 Device: Apple M4 Max (128GB RAM) Total Models: 32 Success Rate: 78% (25/32 passed)
🚨 Critical Failures (7 Models)
The following models failed to run. Detailed diagnostics are provided below to help resolve these issues.
1. Missing Weights (ValueError)
-
Model:
microsoft/Florence-2-large-ft -
Error:
Missing 1 parameters: language_model.lm_head.weight. - Action: Verify model weights or loading logic for Florence-2 architecture.
View Traceback
ValueError: Missing 1 parameters:
language_model.lm_head.weight.
File "mlx/python/mlx/nn/layers/base.py", line 191, in load_weights
raise ValueError(f"Missing {num_missing} parameters: \n{missing}.")
2. Library Compatibility (ImportError)
-
Models:
-
mlx-community/Kimi-VL-A3B-Thinking-2506-bf16 -
mlx-community/Kimi-VL-A3B-Thinking-8bit
-
-
Error:
cannot import name '_validate_images_text_input_order' from 'transformers.processing_utils' -
Context: This suggests the
transformersversion installed (4.57.1) might be too new or too old for the specificprocessing_kimi_vl.pyscript downloaded from the Hub. -
Action: Check the model card for specific
transformersversion requirements.
View Traceback
ImportError: cannot import name '_validate_images_text_input_order' from 'transformers.processing_utils' (/opt/homebrew/Caskroom/miniconda/base/envs/mlx-vlm/lib/python3.13/site-packages/transformers/processing_utils.py)
File "transformers/models/auto/processing_auto.py", line 387, in from_pretrained
File "transformers/dynamic_module_utils.py", line 616, in get_class_from_dynamic_module
File "processing_kimi_vl.py", line 25, in <module>
3. Missing Dependencies (ImportError)
-
Models:
-
mlx-community/Molmo-7B-D-0924-8bit -
mlx-community/Molmo-7B-D-0924-bf16
-
-
Error:
This modeling file requires the following packages that were not found in your environment: tensorflow - Context: The dynamic module for Molmo explicitly checks for TensorFlow.
-
Action: Install TensorFlow (
pip install tensorflow) or look for a pure PyTorch/MLX implementation.
View Traceback
ImportError: This modeling file requires the following packages that were not found in your environment: tensorflow. Run `pip install tensorflow`
File "transformers/dynamic_module_utils.py", line 260, in check_imports
4. Out of Memory (OOM)
-
Model:
mlx-community/Qwen2-VL-2B-Instruct-4bit -
Error:
[metal::malloc] Attempting to allocate 135383101952 bytes which is greater than the maximum allowed buffer size of 86586540032 bytes. - Context: Attempting to allocate ~135GB on a 128GB machine (limit ~86GB for single buffer). This is highly unusual for a 2B parameter model (which should take <2GB).
- Action: This indicates a bug in the model implementation or weight loading (e.g., infinite loop in tensor expansion or incorrect shape inference).
View Traceback
ValueError: [metal::malloc] Attempting to allocate 135383101952 bytes which is greater than the maximum allowed buffer size of 86586540032 bytes.
5. Template Error
-
Model:
mlx-community/gemma-3-12b-pt-8bit -
Error:
Cannot use apply_chat_template because this processor does not have a chat template. -
Context: The model's
tokenizer_config.jsonis missing thechat_templatefield. - Action: Define a chat template manually or use the base model generation instead of chat.
View Traceback
ValueError: Cannot use apply_chat_template because this processor does not have a chat template.
⚠️ Quality Warnings
Models that ran but produced suboptimal output:
-
Repetitive Output:
microsoft/Phi-3.5-vision-instruct(Generated garbage:"# instruction # solution...") -
Context Ignored: 7 models (e.g.,
SmolVLM2,Paligemma2) failed to incorporate the provided context (Hastings, UK). - Formatting Issues: 3 models produced broken markdown or unknown tags.
🏆 Performance Highlights
-
Fastest (TPS):
prince-canuma/Florence-2-large-ft(~327 TPS) -
Most Efficient:
qnguyen3/nanoLLaVA(4.5 GB Peak Memory) -
Slowest:
meta-llama/Llama-3.2-11B-Vision-Instruct(~3.9 TPS) - Note: Significantly slower than expected for M4 Max.
📝 Recommendations
-
Fix Dependencies: Install
tensorflowto support Molmo models, although this used to cause a lock up -
Investigate Kimi-VL: Check
transformersversion compatibility for Kimi-VL models. - Debug Qwen2-VL: The OOM on a 2B model is anomalous; check for infinite loops or massive buffer allocations in the MLX implementation.
-
Update Florence-2: Re-download or check the conversion for
microsoft/Florence-2-large-ft.
Issue Report: MLX-VLM Model Performance and Failures
Date: 2025-11-25
System: Apple M4 Max (128GB RAM, 40 GPU Cores)
OS: macOS
Python: 3.13.9
MLX Version: 0.30.1.dev20251125+c9f4dc85
MLX-VLM Version: 0.3.7
Transformers Version: 4.57.3
Summary
A test of 33 MLX-VLM models was conducted using check_models.py. While many models performed well, several critical issues were identified, ranging from dependency conflicts (tensorflow) and import errors to massive OOM crashes and model quality failures (garbage output, context ignorance).
1. Critical Failures & Bugs
A. Out-Of-Memory (OOM) Crash on 2B Model
Model: mlx-community/Qwen2-VL-2B-Instruct-4bit
Error: [metal::malloc] Attempting to allocate 135383101952 bytes which is greater than the maximum allowed buffer size of 86586540032 bytes.
Observation: A 2B parameter model (even at 4-bit) requesting ~135GB of memory is highly anomalous. This suggests a severe bug in the model configuration, quantization, or the memory management for this specific architecture in MLX.
B. Unexpected TensorFlow Dependency
Models:
-
mlx-community/Molmo-7B-D-0924-8bit -
mlx-community/Molmo-7B-D-0924-bf16Error:ImportError: This modeling file requires the following packages that were not found in your environment: tensorflow.(NB: installingTensorFlowcan cause thecheck_models.pyscript to lock on a mutex.) Observation: MLX models should ideally run without a TensorFlow dependency, especially on Apple Silicon wheretensorflow-macoscan be tricky or undesirable to mix with PyTorch/MLX environments. This limits the portability of these Molmo ports.
C. Transformers Library Incompatibility
Models:
-
mlx-community/Kimi-VL-A3B-Thinking-2506-bf16 -
mlx-community/Kimi-VL-A3B-Thinking-8bitError:ImportError: cannot import name '_validate_images_text_input_order' from 'transformers.processing_utils'Observation: These models appear to rely on internal or deprecatedtransformersAPIs that are not present in version4.57.3. This indicates a need for the model maintainers to update their custom code or pin a specifictransformersversion.
D. Model Loading Failures (Missing Parameters)
Model: microsoft/Florence-2-large-ft
Error: ValueError: Missing 1 parameters: language_model.lm_head.weight.
Observation: The weights for the LM head appear to be missing or misnamed in the MLX conversion of this model.
E. Missing Chat Template
Model: mlx-community/gemma-3-12b-pt-8bit
Error: ValueError: Cannot use apply_chat_template because this processor does not have a chat template.
Observation: The processor config for this model lacks a defined chat template, causing apply_chat_template to fail.
2. Model Quality & Behavior Observations
A. Garbage Output
Model: prince-canuma/Florence-2-large-ft
Output: <s><s><s><s>... (repeated indefinitely)
Observation: While this version of Florence-2 loaded (unlike the Microsoft one), it failed to generate meaningful text, producing only start-of-sentence tokens. (I thought I'd deleted the Microsoft version but it seems to get downloaded again.)
B. Repetitive Loops
Model: mlx-community/paligemma2-10b-ft-docci-448-6bit
Output: "The tiles are wet... The tiles are wet... The tiles are wet..."
Observation: The model entered a degenerate repetition loop.
C. Context Ignorance
Models:
-
mlx-community/SmolVLM2-2.2B-Instruct-mlx -
mlx-community/llava-v1.6-mistral-7b-8bit -
mlx-community/paligemma2variants -
mlx-community/Llama-3.2-11B-Vision-Instruct-8bitObservation: These models completely ignored the provided context (location: Hastings, White Rock, UK) and generated generic descriptions.Llama-3.2-11Bspecifically complained about not seeing the image or context ("image itself is not visible to me"), suggesting a prompt formatting or image encoding issue specific to that model.
D. Excessive Verbosity / "Thinking"
Model: mlx-community/Qwen3-VL-2B-Thinking-bf16
Observation: This model produced a very long "thinking" trace before the actual answer. While interesting, it was flagged as "excessively verbose" (500 tokens). Users should be aware of this behavior for "Thinking" models.
E. High Quality Performers
Models:
-
mlx-community/pixtral-12b-8bit/bf16 -
mlx-community/Phi-3.5-vision-instruct-bf16Observation: These models successfully integrated the context ("Hastings", "White Rock") into detailed, atmospheric descriptions. Pixtral in particular provided excellent structured output (Caption, Description, Keywords), though it was flagged for "excessive bullets".
AI Recommendations
- Investigate Qwen2-VL-2B OOM: This is the highest priority bug given the massive memory request.
- Fix Molmo Dependencies: Remove the hard TensorFlow dependency if possible.
-
Update Kimi-VL: Patch the
transformersimport to be compatible with modern versions. - Review Florence-2 Conversions: Both the Microsoft and Prince Canuma versions are broken (one fails to load, one generates garbage).
Full results:
https://github.com/jrp2014/check_models/tree/4020c32f0c360d8023e326cde9c956588d93d207/src/output
Thanks, this last report is nicely detailed
Let me breakdown into issues
@jrp2014 Could you open an issue for the Qwen2-VL OOM with inputs and outputs and entire traceback?
Sure. (The full tracebacks in a variety of formats are available in the link above.)