mlx-vlm PaliGemma 2 mix segment multiple objects

I am having trouble segmenting multiple objects when using PaliGemma 2 mix ("mlx-community/paligemma2-3b-mix-448-bf16", "mlx-community/paligemma2-10b-mix-448-8bit"). I also tried to directly use transformers and with the 3B model I sometimes get more than one segmented object, and sometimes I only get one. But with mlx-vlm I can only get one object segmented no matter what I try. Is there a working example? Or is there some known issue I have missed? Thank you!

Apr 09 '25 13:04 JoeJoe1313

I found one working example for mlx-vlm as well, I believe the model is just quite unstable in terms of this task. In this successful case it returns two segmentations but both containing the same label, the one of the second object. This is the image https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg and the prompt is "segment left wheel ; right wheel\n". Also, when I attempt to segment the wheels of another image containing a car (in the same position) it fails.

Apr 10 '25 10:04 JoeJoe1313

I am having trouble segmenting multiple objects when using PaliGemma 2 mix ("mlx-community/paligemma2-3b-mix-448-bf16", "mlx-community/paligemma2-10b-mix-448-8bit"). I also tried to directly use transformers and with the 3B model I sometimes get more than one segmented object, and sometimes I only get one. But with mlx-vlm I can only get one object segmented no matter what I try. Is there a working example? Or is there some known issue I have missed? Thank you!

Hey @JoeJoe1313

Thanks for bringing this up!

Could you share a reproducible example?

Apr 10 '25 11:04 Blaizzy

If you could share the transformers examples as well would be nice

Preferably with the images

Apr 10 '25 11:04 Blaizzy

Here is the mlx example which is working, including plotting the masks on top of the images: https://github.com/JoeJoe1313/LLMs-Journey/blob/main/VLMs/paligemma_segmentation_mlx.py. The prompt is "segment left wheel ; right wheel" I don't have the transformers example - I directly copied from the documentation there, but I found the car image worked both for transformers and mlx-vlm so I assumed it's just a model issue. This image https://big-vision-paligemma-hf.hf.space/file=/tmp/gradio/d834f0b8126a6b8422136f3b7b1403d98a2da507/cats.png prompted with segment cat in front ; cat in back returns only one object.

Apr 11 '25 16:04 JoeJoe1313

From what I understand these models are very sensitive to the prompt formatting, and the 448-3B-bf16 and 448-10B-8bit seem to be just not powerful enough for the task of segmenting multiple objects. Please correct me if you have other observations.

Apr 11 '25 16:04 JoeJoe1313

Yes, the problem of models like this and some OCR models like DeepSeek-OCR is that prompt matters.

And for such tasks it's best to use bf16 or fp16, quants struggle with fine details.

Nov 10 '25 17:11 Blaizzy