segment-anything icon indicating copy to clipboard operation
segment-anything copied to clipboard

Text as prompts

Open peiwang062 opened this issue 2 years ago • 25 comments
trafficstars

Thanks for leasing this wonderful work! I saw the demo shows examples of using point, box as input prompt. Does the demo support text as prompt?

peiwang062 avatar Apr 08 '23 05:04 peiwang062

Following! Text prompting has been mentioned in the research paper but hasn't been released yet. Really looking forward to this feature because I need it for a specific use case.

stefanjaspers avatar Apr 08 '23 17:04 stefanjaspers

Exactly, wait for it to be released

darvilabtech avatar Apr 09 '23 07:04 darvilabtech

Thank you for your exciting work!

I also want to use text as prompt to generate mask in my project. Now i am using ClipSeg to generate the mask, but it can not performance well in fine grained semantics.

When do you plan to open source the code of text as prompt? What is the approximate time line? Waiting for this amazing work.

HaoZhang990127 avatar Apr 09 '23 10:04 HaoZhang990127

following

jy00161yang avatar Apr 09 '23 11:04 jy00161yang

following

eware-godaddy avatar Apr 09 '23 15:04 eware-godaddy

The paper mentioned they used CLIP to handle text prompts:

We represent points and boxes by positional encodings [95] summed with learned embeddings for each prompt type and free-form text with an off-the-shelf text encoder from CLIP [82].

It appears the demo does not seem to allow textual inputs though.

0xbitches avatar Apr 09 '23 21:04 0xbitches

@peiwang062 @stefanjaspers @HaoZhang990127 @eware-godaddy @jy00161yang https://github.com/IDEA-Research/Grounded-Segment-Anything does what we are all looking after.

darvilabtech avatar Apr 09 '23 22:04 darvilabtech

@peiwang062 @stefanjaspers @HaoZhang990127 @eware-godaddy @jy00161yang https://github.com/IDEA-Research/Grounded-Segment-Anything does what we are all looking after.

yes, we could simply combine these two, but if SAM can do it better, why do we need two models. We don’t know if Grounding Dino is the bottleneck if we just use its output to SAM.

peiwang062 avatar Apr 10 '23 00:04 peiwang062

@peiwang062 @stefanjaspers @HaoZhang990127 @eware-godaddy @jy00161yang https://github.com/IDEA-Research/Grounded-Segment-Anything does what we are all looking after.

why not use output of SAM as bounding box?

alexw994 avatar Apr 10 '23 02:04 alexw994

@peiwang062 @stefanjaspers @HaoZhang990127 @eware-godaddy @jy00161yang https://github.com/IDEA-Research/Grounded-Segment-Anything does what we are all looking after.

why not use output of SAM as bounding box?

The current version of SAM without of the CLIP text encoder only gives instances from image points or bounding boxes as prompts. Thus SAM's output is only instances without semantic information about the segmentation. With a text encoder you can correlate SAM's output with text such as an object of interest in an image. If you have a promptable text based object detector you can bridge this link for the time being thus SAM's output not generic instance segmentation anymore which is what Grounded Segment Anything is helping to do.

narbhar avatar Apr 10 '23 05:04 narbhar

Put together a demo of grounded-segment-anything with radio for better testing. I tested using clip, open-clip, and groundingdino. Groudingdino performs much better with a great performance. Less than 1 sec on a A100 for DINO+SAM. Maybe ill add the clip versions as well. https://github.com/luca-medeiros/lang-segment-anything

luca-medeiros avatar Apr 10 '23 06:04 luca-medeiros

@peiwang062 @stefanjaspers @HaoZhang990127 @eware-godaddy @jy00161yang https://github.com/IDEA-Research/Grounded-Segment-Anything does what we are all looking after.

why not use output of SAM as bounding box?

The current version of SAM without of the CLIP text encoder only gives instances from image points or bounding boxes as prompts. Thus SAM's output is only instances without semantic information about the segmentation. With a text encoder you can correlate SAM's output with text such as an object of interest in an image. If you have a promptable text based object detector you can bridge this link for the time being thus SAM's output not generic instance segmentation anymore which is what Grounded Segment Anything is helping to do.

If SAM just segment in the bounding box, I think many methods can be used for this also. Like BoxInstSeg https://github.com/LiWentomng/BoxInstSeg

alexw994 avatar Apr 10 '23 06:04 alexw994

@peiwang062 @stefanjaspers @HaoZhang990127 @eware-godaddy @jy00161yang https://github.com/IDEA-Research/Grounded-Segment-Anything does what we are all looking after.

why not use output of SAM as bounding box?

The current version of SAM without of the CLIP text encoder only gives instances from image points or bounding boxes as prompts. Thus SAM's output is only instances without semantic information about the segmentation. With a text encoder you can correlate SAM's output with text such as an object of interest in an image. If you have a promptable text based object detector you can bridge this link for the time being thus SAM's output not generic instance segmentation anymore which is what Grounded Segment Anything is helping to do.

If SAM just segment in the bounding box, I think many methods can be used for this also. Like BoxInstSeg https://github.com/LiWentomng/BoxInstSeg

It should be able to support boxes, points, masks and text as prompts as the paper mentions, no?

peiwang062 avatar Apr 10 '23 06:04 peiwang062

following

nikolausWest avatar Apr 10 '23 10:04 nikolausWest

following

yash0307 avatar Apr 10 '23 12:04 yash0307

Following

9p15p avatar Apr 11 '23 12:04 9p15p

Following

fyuf avatar Apr 11 '23 17:04 fyuf

following

Zhangwenyao1 avatar Apr 12 '23 15:04 Zhangwenyao1

Our work can achieve text to mask with SAM: https://github.com/xmed-lab/CLIP_Surgery

This is our work about CLIP's explainability. It's able to guide SAM to achieve text to mask without manual points.

Besides, it's very simple without any fine-tuning, using the CLIP model itself only.

Furthermore, it enhances many open-vocabulary tasks, like segmentation, multi-label classification, multimodal visualization.

This is the jupyter demo: https://github.com/xmed-lab/CLIP_Surgery/blob/master/demo.ipynb

fig4

Eli-YiLi avatar Apr 13 '23 05:04 Eli-YiLi

following

zaojiahua avatar Apr 17 '23 04:04 zaojiahua

can try the result using this explorer extension https://chrome.google.com/webstore/detail/text-prompts-for-segment/jndfmkiclniflknfifngodjnmlibhjdo/related

FrancisDacian avatar Apr 18 '23 02:04 FrancisDacian

following

ignoHH avatar Apr 21 '23 14:04 ignoHH

following

bjccdsrlcr avatar Apr 24 '23 06:04 bjccdsrlcr

following

mydcxiao avatar Apr 25 '23 06:04 mydcxiao

+1

xuxiaoxxxx avatar May 04 '23 06:05 xuxiaoxxxx

following

daminnock avatar May 19 '23 04:05 daminnock

following

Alice1820 avatar May 22 '23 06:05 Alice1820

following

freshman97 avatar Jun 06 '23 19:06 freshman97

waiting for it

zhangjingxian1998 avatar Jul 25 '23 07:07 zhangjingxian1998

following

N-one avatar Sep 15 '23 03:09 N-one