Segment-Everything-Everywhere-All-At-Once Differences between SEEM Focal-L and the Huggingface Demo model?

The demo outputs a warning:

The current model is run on SEEM Focal-L, for best performance refer to our demo.

And then the performance of the model seems to be worse than the SEEM demo at the Huggingface. In particular, I've noticed that the segmentation with referring text are worse, they "splash" onto the neighboring objects. What are the differences between the official demo on Huggingface and the published SEEM Focal-L checkpoint and config?

May 08 '23 21:05 dchichkov

I found same difference between Focal-L checkpoint and "best performance refer to [our demo]"

May 09 '23 03:05 xiezhang666

Maybe related:

https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once/issues/24#issuecomment-1527719127

May 09 '23 11:05 woctezuma

Our demo use Davit-d5 backbone which is different from Focal-L

May 11 '23 15:05 MaureenZOU

Is there a way to adjust some threshold, to reduce the tendency to "splash" onto the neighboring objects when the segmentation with referring text is used?

When the segmentation mask for referring text is splashing onto the neighboring objects, the functionality of the model is very limited. In the default segmentation mode a lot of less common objects simply get ignored. So the segments are not available. And then, if the model is prompted for a particular object (i.e. we know that the object should be there), the segment is splashing onto the neighboring objects. Which, again, is not useful.

And it'd be great to have that checkpoint from the official demo 🐶

May 11 '23 20:05 dchichkov

Thanks so much for the comments, could you please provide some example images? Again, for referring segmentation, I highly suggesting of using X-Decoder instead f seem, as SEEM is ONLY trained with COCO.

May 11 '23 21:05 MaureenZOU

Thank you!

For Seem/X-Decoder, I see the checkpoint, but I don't seem to be able to find the config in the repository. Should this be something like xdecoder_focall_lang.yaml?

Sure, if it helps, here's the image on which I see the the segmentation spreading onto the nearby pixels: And the original image is here, with the prompt: forklift. test

May 19 '23 20:05 dchichkov

Thanks so much for the comments, could you please provide some example images? Again, for referring segmentation, I highly suggesting of using X-Decoder instead f seem, as SEEM is ONLY trained with COCO.

When I try this, I get this error message:

'GeneralizedXdecoder' object has no attribute 'evaluate_demo'

Following the demo code, here's what I'm doing:

from modeling.BaseModel import BaseModel
from modeling import build_model
from utils.distributed import init_distributed
from utils.arguments import load_opt_from_config_files
from utils.constants import COCO_PANOPTIC_CLASSES

from demo.seem.tasks import interactive_infer_image

opt = load_opt_from_config_files(["configs/xdecoder/focall_unicl_lang.yaml"]) # xdecoder over seem for referring segmentation: https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once/issues/33#issuecomment-1544714768
opt = init_distributed(opt)

cur_model = 'Focal-L'
checkpoints_folder = "/path/to/folder"
checkpoint_name = "xdecoder_focall_last.pt"
pretrained_pth = os.path.join(checkpoints_folder, checkpoint_name)

model = BaseModel(opt, build_model(opt)).from_pretrained(pretrained_pth).eval().cuda()
with torch.no_grad():
    model.model.sem_seg_head.predictor.lang_encoder.get_text_embeddings(COCO_PANOPTIC_CLASSES + ["background"], is_eval=True)


audio = None
@torch.no_grad()
def inference(image, task, *args, **kwargs):
    with torch.autocast(device_type='cuda', dtype=torch.float16):
        return interactive_infer_image(model, audio, image, task, *args, **kwargs)

result_image = interactive_infer_image(
    model=model,  # your trained model object
    image={'image': main_image}, 
    tasks = ["Example"],
    refimg={"image": clothing_image, "mask": clothing_image_mask_3d},
# crashes 
)

Dec 20 '23 19:12 LWprogramming

Segment-Everything-Everywhere-All-At-Once Segment-Everything-Everywhere-All-At-Once copied to clipboard

Differences between SEEM Focal-L and the Huggingface Demo model?

Segment-Everything-Everywhere-All-At-Once
Segment-Everything-Everywhere-All-At-Once copied to clipboard