segment-anything Text as prompts

trafficstars

Thanks for leasing this wonderful work! I saw the demo shows examples of using point, box as input prompt. Does the demo support text as prompt?

Apr 08 '23 05:04 peiwang062

Following! Text prompting has been mentioned in the research paper but hasn't been released yet. Really looking forward to this feature because I need it for a specific use case.

Apr 08 '23 17:04 stefanjaspers

Exactly, wait for it to be released

Apr 09 '23 07:04 darvilabtech

Thank you for your exciting work!

I also want to use text as prompt to generate mask in my project. Now i am using ClipSeg to generate the mask, but it can not performance well in fine grained semantics.

When do you plan to open source the code of text as prompt? What is the approximate time line? Waiting for this amazing work.

Apr 09 '23 10:04 HaoZhang990127

following

Apr 09 '23 11:04 jy00161yang

following

Apr 09 '23 15:04 eware-godaddy

The paper mentioned they used CLIP to handle text prompts:

We represent points and boxes by positional encodings [95] summed with learned embeddings for each prompt type and free-form text with an off-the-shelf text encoder from CLIP [82].

It appears the demo does not seem to allow textual inputs though.

Apr 09 '23 21:04 0xbitches

@peiwang062 @stefanjaspers @HaoZhang990127 @eware-godaddy @jy00161yang https://github.com/IDEA-Research/Grounded-Segment-Anything does what we are all looking after.

Apr 09 '23 22:04 darvilabtech

@peiwang062 @stefanjaspers @HaoZhang990127 @eware-godaddy @jy00161yang https://github.com/IDEA-Research/Grounded-Segment-Anything does what we are all looking after.

yes, we could simply combine these two, but if SAM can do it better, why do we need two models. We don’t know if Grounding Dino is the bottleneck if we just use its output to SAM.

Apr 10 '23 00:04 peiwang062

@peiwang062 @stefanjaspers @HaoZhang990127 @eware-godaddy @jy00161yang https://github.com/IDEA-Research/Grounded-Segment-Anything does what we are all looking after.

why not use output of SAM as bounding box?

Apr 10 '23 02:04 alexw994

@peiwang062 @stefanjaspers @HaoZhang990127 @eware-godaddy @jy00161yang https://github.com/IDEA-Research/Grounded-Segment-Anything does what we are all looking after.

why not use output of SAM as bounding box?

The current version of SAM without of the CLIP text encoder only gives instances from image points or bounding boxes as prompts. Thus SAM's output is only instances without semantic information about the segmentation. With a text encoder you can correlate SAM's output with text such as an object of interest in an image. If you have a promptable text based object detector you can bridge this link for the time being thus SAM's output not generic instance segmentation anymore which is what Grounded Segment Anything is helping to do.

Apr 10 '23 05:04 narbhar

Put together a demo of grounded-segment-anything with radio for better testing. I tested using clip, open-clip, and groundingdino. Groudingdino performs much better with a great performance. Less than 1 sec on a A100 for DINO+SAM. Maybe ill add the clip versions as well. https://github.com/luca-medeiros/lang-segment-anything

Apr 10 '23 06:04 luca-medeiros

@peiwang062 @stefanjaspers @HaoZhang990127 @eware-godaddy @jy00161yang https://github.com/IDEA-Research/Grounded-Segment-Anything does what we are all looking after.

why not use output of SAM as bounding box?

The current version of SAM without of the CLIP text encoder only gives instances from image points or bounding boxes as prompts. Thus SAM's output is only instances without semantic information about the segmentation. With a text encoder you can correlate SAM's output with text such as an object of interest in an image. If you have a promptable text based object detector you can bridge this link for the time being thus SAM's output not generic instance segmentation anymore which is what Grounded Segment Anything is helping to do.

If SAM just segment in the bounding box, I think many methods can be used for this also. Like BoxInstSeg https://github.com/LiWentomng/BoxInstSeg

Apr 10 '23 06:04 alexw994

@peiwang062 @stefanjaspers @HaoZhang990127 @eware-godaddy @jy00161yang https://github.com/IDEA-Research/Grounded-Segment-Anything does what we are all looking after.

why not use output of SAM as bounding box?

The current version of SAM without of the CLIP text encoder only gives instances from image points or bounding boxes as prompts. Thus SAM's output is only instances without semantic information about the segmentation. With a text encoder you can correlate SAM's output with text such as an object of interest in an image. If you have a promptable text based object detector you can bridge this link for the time being thus SAM's output not generic instance segmentation anymore which is what Grounded Segment Anything is helping to do.

If SAM just segment in the bounding box, I think many methods can be used for this also. Like BoxInstSeg https://github.com/LiWentomng/BoxInstSeg

It should be able to support boxes, points, masks and text as prompts as the paper mentions, no?

Apr 10 '23 06:04 peiwang062

following

Apr 10 '23 10:04 nikolausWest

following

Apr 10 '23 12:04 yash0307

Following

Apr 11 '23 12:04 9p15p

Following

Apr 11 '23 17:04 fyuf

following

Apr 12 '23 15:04 Zhangwenyao1

Our work can achieve text to mask with SAM: https://github.com/xmed-lab/CLIP_Surgery

This is our work about CLIP's explainability. It's able to guide SAM to achieve text to mask without manual points.

Besides, it's very simple without any fine-tuning, using the CLIP model itself only.

Furthermore, it enhances many open-vocabulary tasks, like segmentation, multi-label classification, multimodal visualization.

This is the jupyter demo: https://github.com/xmed-lab/CLIP_Surgery/blob/master/demo.ipynb

fig4

Apr 13 '23 05:04 Eli-YiLi

following

Apr 17 '23 04:04 zaojiahua

can try the result using this explorer extension https://chrome.google.com/webstore/detail/text-prompts-for-segment/jndfmkiclniflknfifngodjnmlibhjdo/related