PaliGemma 2 mix segment multiple objects
I am having trouble segmenting multiple objects when using PaliGemma 2 mix ("mlx-community/paligemma2-3b-mix-448-bf16", "mlx-community/paligemma2-10b-mix-448-8bit"). I also tried to directly use transformers and with the 3B model I sometimes get more than one segmented object, and sometimes I only get one. But with mlx-vlm I can only get one object segmented no matter what I try. Is there a working example? Or is there some known issue I have missed? Thank you!
I found one working example for mlx-vlm as well, I believe the model is just quite unstable in terms of this task. In this successful case it returns two segmentations but both containing the same label, the one of the second object. This is the image https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg and the prompt is "segment left wheel ; right wheel\n". Also, when I attempt to segment the wheels of another image containing a car (in the same position) it fails.
I am having trouble segmenting multiple objects when using PaliGemma 2 mix ("mlx-community/paligemma2-3b-mix-448-bf16", "mlx-community/paligemma2-10b-mix-448-8bit"). I also tried to directly use transformers and with the 3B model I sometimes get more than one segmented object, and sometimes I only get one. But with mlx-vlm I can only get one object segmented no matter what I try. Is there a working example? Or is there some known issue I have missed? Thank you!
Hey @JoeJoe1313
Thanks for bringing this up!
Could you share a reproducible example?
If you could share the transformers examples as well would be nice
Preferably with the images
Here is the mlx example which is working, including plotting the masks on top of the images: https://github.com/JoeJoe1313/LLMs-Journey/blob/main/VLMs/paligemma_segmentation_mlx.py. The prompt is "segment left wheel ; right wheel" I don't have the transformers example - I directly copied from the documentation there, but I found the car image worked both for transformers and mlx-vlm so I assumed it's just a model issue. This image https://big-vision-paligemma-hf.hf.space/file=/tmp/gradio/d834f0b8126a6b8422136f3b7b1403d98a2da507/cats.png prompted with segment cat in front ; cat in back returns only one object.
From what I understand these models are very sensitive to the prompt formatting, and the 448-3B-bf16 and 448-10B-8bit seem to be just not powerful enough for the task of segmenting multiple objects. Please correct me if you have other observations.
Yes, the problem of models like this and some OCR models like DeepSeek-OCR is that prompt matters.
And for such tasks it's best to use bf16 or fp16, quants struggle with fine details.