sam3 icon indicating copy to clipboard operation
sam3 copied to clipboard

Could SAM3 be used as a VFM to obtain all class-agnostic masks for image?

Open wendy26zhang opened this issue 3 weeks ago • 3 comments

Thank you for your excellent work! In theory, SAM3 should be capable of extracting all object masks from an image, similar to previous versions in the SAM series. Could you please clarify how this functionality can be achieved in practice? I look forward to your response.

wendy26zhang avatar Dec 22 '25 06:12 wendy26zhang

The SAMv1/v2 models did this ('automatic mask generation') by using a grid of single point prompts to generate candidate masks all over the image. This is followed by a series of filtering steps to remove low quality masks and duplicates. You can see this in the original code (mainly the process_batch function inside the automatic_mask_generator.py script).

However SAMv3 doesn't seem to handle this well, since point prompts behave very differently compared to SAMv1/v2. One of the main issues is that it gives very low confidence scores for 'whole object' masks. So SAMv3 may not be suitable for this sort of task, or at least it may require a different approach to get it to work well.

heyoeyo avatar Dec 23 '25 13:12 heyoeyo

Thank you for your reply. Regarding the open-vocabulary experiment, I would like to clarify whether it involves splitting the open concept list and processing or summarizing each concept individually. Could you also share any specific implementation details or code examples?

wendy26zhang avatar Dec 23 '25 13:12 wendy26zhang

I'm not very familiar with the text-processing side of things, but at least code-wise, the model doesn't seem to be doing anything too special.

Most of the text-specific processing happens inside the VETextEncoder and it seems to work very similarly to the image encoder. It has something like a 'patch embedding' (i.e. tokenizer followed by an embedding), a transformer block sequence (which they call the encoder) and a final projection step (which they call the resizer). After that, the model handles the text encoding similarly to how it handles point/box prompt encodings. It doesn't seem to have any (explicit) handling of words or summarizing concepts, as far as I can tell.

heyoeyo avatar Dec 23 '25 15:12 heyoeyo

Thank you so much for your reply. Have a nice day!

wendy26zhang avatar Dec 24 '25 02:12 wendy26zhang