Could SAM3 be used as a VFM to obtain all class-agnostic masks for image?
Thank you for your excellent work! In theory, SAM3 should be capable of extracting all object masks from an image, similar to previous versions in the SAM series. Could you please clarify how this functionality can be achieved in practice? I look forward to your response.
The SAMv1/v2 models did this ('automatic mask generation') by using a grid of single point prompts to generate candidate masks all over the image. This is followed by a series of filtering steps to remove low quality masks and duplicates. You can see this in the original code (mainly the process_batch function inside the automatic_mask_generator.py script).
However SAMv3 doesn't seem to handle this well, since point prompts behave very differently compared to SAMv1/v2. One of the main issues is that it gives very low confidence scores for 'whole object' masks. So SAMv3 may not be suitable for this sort of task, or at least it may require a different approach to get it to work well.
Thank you for your reply. Regarding the open-vocabulary experiment, I would like to clarify whether it involves splitting the open concept list and processing or summarizing each concept individually. Could you also share any specific implementation details or code examples?
I'm not very familiar with the text-processing side of things, but at least code-wise, the model doesn't seem to be doing anything too special.
Most of the text-specific processing happens inside the VETextEncoder and it seems to work very similarly to the image encoder. It has something like a 'patch embedding' (i.e. tokenizer followed by an embedding), a transformer block sequence (which they call the encoder) and a final projection step (which they call the resizer). After that, the model handles the text encoding similarly to how it handles point/box prompt encodings. It doesn't seem to have any (explicit) handling of words or summarizing concepts, as far as I can tell.
Thank you so much for your reply. Have a nice day!