Naive question on point/box prompt
Hello,
I was able to get very good results using auto mode. I have been trying segmentation with points on an image with many animals and I manually created points on each animal. Using points and labels, I only got one mask, whereas auto mode clearly segmented every animal correctly with different color mask.
Can someone explain how point prompts are to be used correctly? Do we have to call Sam in a loop and give only one point at a time? Or, all the points given at a time are combined into one mask?
In the prediction use-case (like the first example under the Getting Started section), each time you call the .predict(...) function, you're going to get a mask as if it was for a single object. So if you have many animals in an image, you'd need to run the predict function many times (one for each animal, with different point/box prompts for each) to get separate masks for each of the animals. Like you said, you'd run it 'in a loop' (although only the predict function, you don't need to re-run the .set_image(...) part every time).
You can technically do these all together using a batched input, but that's more of an optimization thing, it's not really relevant unless you're trying to get this running as fast as possible.
As for the points, you can give multiple points for a single mask output (i.e. single animal), you can also give foreground vs. background points. However, when you give multiple points, they're interpreted as belonging to the same object to be masked. So for example, if you gave points belonging to two different animals, you're asking the model to generate a mask for 'one object' that somehow contains both animals, which is probably going to give bad results. Where multiple points seem to be useful is eliminating ambiguity. Let's say one of the animals is a zebra, and you had a single point that was on one of the zebra's stripes. It's possible the model would try to segment the stripe and not the whole zebra, in this case, you could provide extra points on other (non-striped) parts of the zebra to hint that you want to segment the entire animal, as opposed to segmenting as if it's 'black stripes on a white background'.
I believe you can also provide multiple boxes and even a mix of boxes + points (and again, it's treated as representing a single object), though I've found that if you're using boxes, having only 1 (and no points) tends to work the best/most consistent.
Thank you very much for a wonderful explanation. I was wondering if auto mode works similarly I.e. on 32x32 prompts, the loop is executed 1024 times and each mask is further processed.
I'm not as familiar with the auto mode stuff, but I believe you're correct, it runs as if it's a bunch of single-point prompts in a grid, then does some work to clean up overlapping/small masks.
The actual implementation uses batched input points. For example, the loop part of the script only runs 16 times (with default settings), but uses a batch size of 64 points to process the full 1024 set of points. Changing the batch size to 1 forces the loop to run 1024 times, which ends up taking twice as long (on my machine) to complete. The results look identical to me, so it seems the auto mode batching is equivalent to just running individual points through the model (just doing it in a more efficient way for the gpu).
Thanks a lot for the clear explanation.
In the prediction use-case (like the first example under the Getting Started section), each time you call the
.predict(...)function, you're going to get a mask as if it was for a single object. So if you have many animals in an image, you'd need to run the predict function many times (one for each animal, with different point/box prompts for each) to get separate masks for each of the animals. Like you said, you'd run it 'in a loop' (although only the predict function, you don't need to re-run the.set_image(...)part every time). You can technically do these all together using a batched input, but that's more of an optimization thing, it's not really relevant unless you're trying to get this running as fast as possible.As for the points, you can give multiple points for a single mask output (i.e. single animal), you can also give foreground vs. background points. However, when you give multiple points, they're interpreted as belonging to the same object to be masked. So for example, if you gave points belonging to two different animals, you're asking the model to generate a mask for 'one object' that somehow contains both animals, which is probably going to give bad results. Where multiple points seem to be useful is eliminating ambiguity. Let's say one of the animals is a zebra, and you had a single point that was on one of the zebra's stripes. It's possible the model would try to segment the stripe and not the whole zebra, in this case, you could provide extra points on other (non-striped) parts of the zebra to hint that you want to segment the entire animal, as opposed to segmenting as if it's 'black stripes on a white background'.
I believe you can also provide multiple boxes and even a mix of boxes + points (and again, it's treated as representing a single object), though I've found that if you're using boxes, having only 1 (and no points) tends to work the best/most consistent.
Hello, I am learning about SAM and have a question. For an image, if a bbox of an object is entered into SAM, is the generated mask just the object itself, or all similar objects in the image?
For example, if I input only one bike seat bbox, is the resulting image just a mask of that seat, or a mask of all bike seats
is the generated mask just the object itself, or all similar objects in the image?
SAM only outputs a mask for the selected object based on the prompt (bbox or points).
To get all similar objects, you'd generally need a model trained to understand what counts as 'similar' (e.g. similar position in the image? similar size? similar lighting? similar orientation?). Semantic segmentation models are usually trained to do something like this in a way that might work for an image like this, but I don't know if 'bike seat' is something a pretrained model would work with.
Aside from training a custom model, you might have better luck using something like grounding dino (with SAM even), which takes a text prompt as input, so it may generalize better to uncommon classes.
is the generated mask just the object itself, or all similar objects in the image?
SAM only outputs a mask for the selected object based on the prompt (bbox or points).
To get all similar objects, you'd generally need a model trained to understand what counts as 'similar' (e.g. similar position in the image? similar size? similar lighting? similar orientation?). Semantic segmentation models are usually trained to do something like this in a way that might work for an image like this, but I don't know if 'bike seat' is something a pretrained model would work with.
Aside from training a custom model, you might have better luck using something like grounding dino (with SAM even), which takes a text prompt as input, so it may generalize better to uncommon classes.
Thank you for your reply, I have understood the issue and send my regards to you
@heyoeyo
Hi friend, thank you very much for your guidance and advice. However, I have a few questions I'd like to ask:
- In non-auto mode, do I need to manually define the points and then feed them into the model? How should I write this part of the code? I don't have a clear idea—could you please give me some guidance?
- If they are manually defined, should I iteratively feed multiple points for the same target into the model? I would greatly appreciate it if you could provide relevant code examples.
do I need to manually define the points and then feed them into the model? How should I write this part of the code?
The predictor_example notebook has examples for how to do this. Here's the basic idea (taken from the Selecting objects with SAM section of the notebook):
# Load model & process image data
# (assumes image has already been loaded)
ckpt, mtype, device = "sam_vit_h_4b8939.pth", "vit_h", "cuda"
sam = sam_model_registry[mtype](checkpoint=ckpt)
sam.to(device=device)
predictor = SamPredictor(sam)
# Process image data
image = cv2.imread("path/to/image.jpg")
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
predictor.set_image(image)
# Provide a single foreground prompt point to generate segmentation mask
input_point = np.array([[100, 200]])
input_label = np.array([1]) # '1' means foreground point, '0' means background
masks, scores, logits = predictor.predict(
point_coords=input_point,
point_labels=input_label,
multimask_output=True,
)
should I iteratively feed multiple points for the same target into the model?
If you're providing multiple points for a single object, you'll want to pass them in altogether rather than iteratively. Compared to the code above, it's just a matter of updating the input_point & input_label values to store additional points:
input_point = np.array([[100, 200], [300, 400])
input_label = np.array([1, 1])
That same notebook has examples for this as well, under the Specifying a specific object with additional points section.
@heyoeyo Thank you so much for your help. I was too careless and didn't notice this document. Thank you for pointing it out!!! I really appreciate your detailed responses after a year of research. Thank you, you're amazing!