mlx-vlm icon indicating copy to clipboard operation
mlx-vlm copied to clipboard

PaliGemma 2 mix segment multiple objects

Open JoeJoe1313 opened this issue 10 months ago • 5 comments

I am having trouble segmenting multiple objects when using PaliGemma 2 mix ("mlx-community/paligemma2-3b-mix-448-bf16", "mlx-community/paligemma2-10b-mix-448-8bit"). I also tried to directly use transformers and with the 3B model I sometimes get more than one segmented object, and sometimes I only get one. But with mlx-vlm I can only get one object segmented no matter what I try. Is there a working example? Or is there some known issue I have missed? Thank you!

JoeJoe1313 avatar Apr 09 '25 13:04 JoeJoe1313

I found one working example for mlx-vlm as well, I believe the model is just quite unstable in terms of this task. In this successful case it returns two segmentations but both containing the same label, the one of the second object. This is the image https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg and the prompt is "segment left wheel ; right wheel\n". Also, when I attempt to segment the wheels of another image containing a car (in the same position) it fails.

JoeJoe1313 avatar Apr 10 '25 10:04 JoeJoe1313

I am having trouble segmenting multiple objects when using PaliGemma 2 mix ("mlx-community/paligemma2-3b-mix-448-bf16", "mlx-community/paligemma2-10b-mix-448-8bit"). I also tried to directly use transformers and with the 3B model I sometimes get more than one segmented object, and sometimes I only get one. But with mlx-vlm I can only get one object segmented no matter what I try. Is there a working example? Or is there some known issue I have missed? Thank you!

Hey @JoeJoe1313

Thanks for bringing this up!

Could you share a reproducible example?

Blaizzy avatar Apr 10 '25 11:04 Blaizzy

If you could share the transformers examples as well would be nice

Preferably with the images

Blaizzy avatar Apr 10 '25 11:04 Blaizzy

Here is the mlx example which is working, including plotting the masks on top of the images: https://github.com/JoeJoe1313/LLMs-Journey/blob/main/VLMs/paligemma_segmentation_mlx.py. The prompt is "segment left wheel ; right wheel" I don't have the transformers example - I directly copied from the documentation there, but I found the car image worked both for transformers and mlx-vlm so I assumed it's just a model issue. This image https://big-vision-paligemma-hf.hf.space/file=/tmp/gradio/d834f0b8126a6b8422136f3b7b1403d98a2da507/cats.png prompted with segment cat in front ; cat in back returns only one object.

JoeJoe1313 avatar Apr 11 '25 16:04 JoeJoe1313

From what I understand these models are very sensitive to the prompt formatting, and the 448-3B-bf16 and 448-10B-8bit seem to be just not powerful enough for the task of segmenting multiple objects. Please correct me if you have other observations.

JoeJoe1313 avatar Apr 11 '25 16:04 JoeJoe1313

Yes, the problem of models like this and some OCR models like DeepSeek-OCR is that prompt matters.

And for such tasks it's best to use bf16 or fp16, quants struggle with fine details.

Blaizzy avatar Nov 10 '25 17:11 Blaizzy