Grounded-Segment-Anything Grounding-DINO occupies the majority of Grounded-SAM's processing time.

Thank you for your excellent work on the Grounded-Segment-Anything project. I've noticed that developers have recently incorporated various advanced SAM models, such as Efficient-SAM and RepViT-SAM. However, it appears that the Grounding-DINO module consumes most of the processing time in Grounded-SAM. As illustrated in the attached picture, while MobileSAM takes only 0.05s, Grounding-DINO requires 1.70s, which is significantly longer. Are there any plans to optimize the Grounding-DINO module, or is there an already available off-the-shelf solution?

Dec 21 '23 15:12 xiaobanni

Hello! For now, we do not have a smaller version of Grounding-DINO, you may replace grounding-dino with other light open-world models as the box prompt generator from the community.

Dec 22 '23 03:12 rentainhe

@rentainhe Thank you for your quick and friendly response. As I am not a professional in the field of Image segmentation, but just want to use its technology in downstream applications. After researching, I didn't find any significantly usable alternatives to Grounding-DINO. Could you recommend some potential solutions for me to try? Also, I found that this need might be common, as evidenced by the widespread discussion in the following link.

Dec 23 '23 14:12 xiaobanni

Does GLIP have the same functions and effects? Compared with Grounding-DINO, can GLIP be seen as a combination of Grounding-DINO detector and BLIP? GLIP seems to have the functions of arbitrary text retrieval and object localization. Does it have the function of image description text output?

Jun 18 '24 08:06 HaoqianSong