mlx-vlm
mlx-vlm copied to clipboard
Inconsistent OCR results with Idefics2 model in mlx_vlm compared to other environments
I am currently evaluating the OCR capabilities of the Idefics2 model, specifically for extracting text from comic book speech balloons.
The model performs as expected on various platforms including my local Linux environment, HF Playground, and Google Colab across different hardware configurations. However, when using the mlx_vlm implementation, the results are inconsistent and generally nonsensical.
- mlx_vlm version: "0.0.6", dev install
- Model Used:
mlx-community/idefics2-8b-4bit(also tested with8bit) - Code Snippet:
model, processor = load("mlx-community/idefics2-8b-4bit") prompt_text_tmpl = "Perform optical character recognition OCR on this image, which contains speech balloons from a comic book. The text is in English." resulting_messages = [ {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": prompt_text_tmpl}]} ] prompt = processor.apply_chat_template(resulting_messages, add_generation_prompt=True) output = generate(model, processor, image, prompt, temp=0.4, max_tokens=512, top_p=0.8, verbose=True)
The expected output should closely match the results from other environments, such as:
["THE ECHO OF THE OLD MAN'S FOOTSTEPS FADES DOWN THE HALL AS ..."]
I can give you the code used to generate that but it follows closely the code from HF Idefics2 model card.
The output from mlx_vlm is significantly different and less accurate:
==========
Image: <PIL.PngImagePlugin.PngImageFile image mode=RGB size=185x145 at 0x537C49490>
Prompt: User:<image>Perform optical character recognition OCR on this image, which contains speech balloons from a comic book. The text is in English.<end_of_utterance>
Assistant:
The text consists of a single word: "down".<end_of_utterance>
==========
Prompt: 76.531 tokens-per-sec
Generation: 49.216 tokens-per-sec
Additional Information
- The issue persists across different quantizations.
- Similar tests with the llava-1.5-7b model in both HF, my Linux rig, and mlx_lvm environments show consistent and more accurate results:
==========
Image: <PIL.PngImagePlugin.PngImageFile image mode=RGB size=185x145 at 0x174971490>
Prompt: <s>[INST] <image>
Perform optical character recognition OCR on this image, which contains speech balloons from a comic book. The text is in English. [/INST]
</s>
The echo of the old man's footsteps fades down the hall.
==========
Prompt: 16.869 tokens-per-sec
Generation: 8.546 tokens-per-sec
Could you please investigate why the mlx_vlm Idefics2 model yields such different results compared to other environments? As far as I can tell, the inputs generated by the processor with transformers and mlx_vlm are the same. It seems the issue might be related to the generation process or detokenization, but I am unsure due to the complexity of the transformers code and my limited familiarity with mlx_lvm.
Hey @civvic,
Thanks for bring this issue up!
I will look into it.
@civvic it's fixed ✅
Just update your release to v0.0.7
pip install -U mlx-vlm
Let me know if you face any other issues :)
Works great, thanks! That’s was quick!
Now on to Paligemma. I’m also interested in Phi-3 V, maybe now it’s the right time to try to decipher the spaghetti transformers and start with mlx.
Most welcome!
If you want to understand transformers code better you can check my video series on YT.
It's about Llama-2 arch but it generalizes :)
https://youtube.com/playlist?list=PLDn_JsyofyfQp4td_ub6LfIg5vxyu6YJK&si=XgpOFYIC20UDKgHz
Paligemma works as well:
Regarding Phi-3 Vision checkout #28
Most welcome!
If you want to understand transformers code better you can check my video series on YT.
It's about Llama-2 arch but it generalizes :)
https://youtube.com/playlist?list=PLDn_JsyofyfQp4td_ub6LfIg5vxyu6YJK&si=XgpOFYIC20UDKgHz
Ah, I've already been tuning into your channel! 😄 The new Llama 3 series looks very intersting. My struggle, though, isn't so much with the Transformers architecture itself—more about navigating through the labyrinth of HuggingFace's transformers code. I stumbled upon their blog post explaining the 'benefits' of spaghetti code, and let's just say, it all makes a bit more sense now why things are the way they are! 🍝