sam3 Ambiguity in Semantic Segmentation Masks from SAM3's Text-to-Mask Prediction

Does SAM3’s approach of taking text input and producing segmentation masks imply that it cannot generate truly semantic segmentation masks? For example, in an image where a patch of lawn has sparse grass coverage, inputting "grass" or "soil" might both result in the same region being labeled as both grass and soil—leading to class ambiguity for that area?

Nov 26 '25 14:11 yangxu-debug

What you are writing is partially true, but incomplete: it's fairly easy to write some heuristics to generate "truly semantic segmentation masks" from SAM3's outputs (and that's what we did to report the semantic segmentation results in the paper). The easiest heuristic being to compute the argmax of the logits from every prompt(=class) at every pixel.

That being said, in the open vocabulary setting I personally think trying to obtain "truly semantic segmentation masks" is almost always misguided for a non trivial (say > 10) number of classes. The reason is that there is almost always ambiguity about what each class actually means and how it should be segmented. As a result some assumptions are baked in the existing datasets to resolve this in arbitrary ways, but that do not necessarily reflect zero shot usage of the model.

Some examples:

Cityscapes contains the classes "person" and "rider". Obviously this is non-sensical because a rider is a person. So what they actually mean is "rider" and "person who is not a rider". But that becomes cumbersome (and prompts like "person who is not a rider" are not atomic, hence not officially supported by SAM3). It also has "wall" and "building", where there is also a part-whole relation. Here "wall" actually means "free-standing wall", which is a bit of an odd concept.
COCO has similar issues. Eg it has the general "food" semantic class, but also some actual food concepts like "banana".

These issues are not important in a closed vocabulary setting, because the model will learn what the dataset-specific biases are. But in the open-vocabulary setting which we're tackling, these issues are more critical, and essentially explain why we chose the binary mask prediction for every class.

Hope this helps.

Nov 26 '25 17:11 alcinos

I was wondering if the code used for evaluating semantic segmentation will be published. With my own code, I am able to get 37.94% mIoU on ADE-150, as opposed to the reported 39.0% in the paper.

Nov 27 '25 16:11 kaanozaltan

I was wondering if the code used for evaluating semantic segmentation will be published. With my own code, I am able to get 37.94% mIoU on ADE-150, as opposed to the reported 39.0% in the paper.

Could you share how you got the semantic segmentation results? Thanks a lot

Dec 02 '25 09:12 cryingdxy

I'm also interested in this, since the paper states they have a dedicated semantic segmentation head (adapted from MaskFormer), however I haven't been able to find an example which uses this head.

Dec 03 '25 11:12 biggeR-data

@cryingdxy I used this function for evaluation:

def predict_semantic_map(processor, image, labels):
    h, w = image.height, image.width
    num_classes = len(labels)
    score_maps = np.zeros((num_classes + 1, h, w), dtype=np.float32)

    state = processor.set_image(image)

    for cls_idx in range(num_classes):
        class_name = labels[cls_idx]
        cls_id = cls_idx + 1

        state = processor.set_text_prompt(prompt=class_name, state=state)

        masks = state["masks"].cpu().numpy()
        scores = state["scores"].cpu().numpy()

        for mask, score in zip(masks, scores):
            if score >= MASK_SCORE_THRESHOLD:
                mask_2d = mask[0]
                valid_pixels = np.sum(mask_2d)
                if valid_pixels >= MIN_MASK_AREA_PIXELS:
                    score_maps[cls_id] += mask_2d.astype(np.float32) * score

    pred_seg = np.argmax(score_maps, axis=0).astype(np.uint8)
    return pred_seg

Dec 03 '25 14:12 kaanozaltan

@cryingdxy I used this function for evaluation:

def predict_semantic_map(processor, image, labels):
    h, w = image.height, image.width
    num_classes = len(labels)
    score_maps = np.zeros((num_classes + 1, h, w), dtype=np.float32)

    state = processor.set_image(image)

    for cls_idx in range(num_classes):
        class_name = labels[cls_idx]
        cls_id = cls_idx + 1

        state = processor.set_text_prompt(prompt=class_name, state=state)

        masks = state["masks"].cpu().numpy()
        scores = state["scores"].cpu().numpy()

        for mask, score in zip(masks, scores):
            if score >= MASK_SCORE_THRESHOLD:
                mask_2d = mask[0]
                valid_pixels = np.sum(mask_2d)
                if valid_pixels >= MIN_MASK_AREA_PIXELS:
                    score_maps[cls_id] += mask_2d.astype(np.float32) * score

    pred_seg = np.argmax(score_maps, axis=0).astype(np.uint8)
    return pred_seg

Can you share the MASK_SCORE_THRESHOLD and MIN_MASK_AREA_PIXELS settings you used? It would really help me out!

Dec 08 '25 13:12 JuiceCoffe

@JuiceCoffe I used these values:

MASK_SCORE_THRESHOLD = 0.7
MIN_MASK_AREA_PIXELS = 25

Dec 08 '25 16:12 kaanozaltan

@JuiceCoffe I used these values:
MASK_SCORE_THRESHOLD = 0.7
MIN_MASK_AREA_PIXELS = 25

I'm wondering during model inference, did you only use the classes present in the labels, or did you perform inference on the image for all 150 classes of ADE-150 and then filter the results? Because if it's the latter, the mIOU I measured is only 32.

Dec 10 '25 08:12 JuiceCoffe