mlx-vlm Inconsistent OCR results with Idefics2 model in mlx

I am currently evaluating the OCR capabilities of the Idefics2 model, specifically for extracting text from comic book speech balloons.

Strange_Tales_172005_7_Default, grey pad

The model performs as expected on various platforms including my local Linux environment, HF Playground, and Google Colab across different hardware configurations. However, when using the mlx_vlm implementation, the results are inconsistent and generally nonsensical.

mlx_vlm version: "0.0.6", dev install
Model Used: mlx-community/idefics2-8b-4bit (also tested with 8bit)

Code Snippet:

model, processor = load("mlx-community/idefics2-8b-4bit")
prompt_text_tmpl = "Perform optical character recognition OCR on this image, which contains speech balloons from a comic book. The text is in English."
resulting_messages = [
    {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": prompt_text_tmpl}]}
]
prompt = processor.apply_chat_template(resulting_messages, add_generation_prompt=True)
output = generate(model, processor, image, prompt, temp=0.4, max_tokens=512, top_p=0.8, verbose=True)

The expected output should closely match the results from other environments, such as:

["THE ECHO OF THE OLD MAN'S FOOTSTEPS FADES DOWN THE HALL AS ..."]

I can give you the code used to generate that but it follows closely the code from HF Idefics2 model card.

The output from mlx_vlm is significantly different and less accurate:

==========
Image: <PIL.PngImagePlugin.PngImageFile image mode=RGB size=185x145 at 0x537C49490> 

Prompt: User:<image>Perform optical character recognition OCR on this image, which contains speech balloons from a comic book. The text is in English.<end_of_utterance>
Assistant:
The text consists of a single word: "down".<end_of_utterance>
==========
Prompt: 76.531 tokens-per-sec
Generation: 49.216 tokens-per-sec

Additional Information

The issue persists across different quantizations.
Similar tests with the llava-1.5-7b model in both HF, my Linux rig, and mlx_lvm environments show consistent and more accurate results:

==========
Image: <PIL.PngImagePlugin.PngImageFile image mode=RGB size=185x145 at 0x174971490> 

Prompt: <s>[INST] <image>
Perform optical character recognition OCR on this image, which contains speech balloons from a comic book. The text is in English. [/INST]
</s>
The echo of the old man's footsteps fades down the hall.
==========
Prompt: 16.869 tokens-per-sec
Generation: 8.546 tokens-per-sec

Could you please investigate why the mlx_vlm Idefics2 model yields such different results compared to other environments? As far as I can tell, the inputs generated by the processor with transformers and mlx_vlm are the same. It seems the issue might be related to the generation process or detokenization, but I am unsure due to the complexity of the transformers code and my limited familiarity with mlx_lvm.

May 25 '24 11:05 civvic

Hey @civvic,

Thanks for bring this issue up!

I will look into it.

May 25 '24 11:05 Blaizzy

@civvic it's fixed ✅

Just update your release to v0.0.7

pip install -U mlx-vlm

Let me know if you face any other issues :)

May 25 '24 19:05 Blaizzy

Works great, thanks! That’s was quick!

Now on to Paligemma. I’m also interested in Phi-3 V, maybe now it’s the right time to try to decipher the spaghetti transformers and start with mlx.

May 25 '24 21:05 civvic

Most welcome!

If you want to understand transformers code better you can check my video series on YT.

It's about Llama-2 arch but it generalizes :)

https://youtube.com/playlist?list=PLDn_JsyofyfQp4td_ub6LfIg5vxyu6YJK&si=XgpOFYIC20UDKgHz

May 25 '24 21:05 Blaizzy

Paligemma works as well:

May 25 '24 21:05 Blaizzy

Regarding Phi-3 Vision checkout #28

May 25 '24 21:05 Blaizzy

Most welcome!

If you want to understand transformers code better you can check my video series on YT.

It's about Llama-2 arch but it generalizes :)

https://youtube.com/playlist?list=PLDn_JsyofyfQp4td_ub6LfIg5vxyu6YJK&si=XgpOFYIC20UDKgHz

Ah, I've already been tuning into your channel! 😄 The new Llama 3 series looks very intersting. My struggle, though, isn't so much with the Transformers architecture itself—more about navigating through the labyrinth of HuggingFace's transformers code. I stumbled upon their blog post explaining the 'benefits' of spaghetti code, and let's just say, it all makes a bit more sense now why things are the way they are! 🍝

May 27 '24 16:05 civvic

mlx-vlm
mlx-vlm copied to clipboard

Inconsistent OCR results with Idefics2 model in mlx_vlm compared to other environments

Additional Information

mlx-vlm mlx-vlm copied to clipboard

Inconsistent OCR results with Idefics2 model in mlx_vlm compared to other environments

Additional Information

mlx-vlm
mlx-vlm copied to clipboard