Grounded-Segment-Anything icon indicating copy to clipboard operation
Grounded-Segment-Anything copied to clipboard

Grounding-DINO occupies the majority of Grounded-SAM's processing time.

Open xiaobanni opened this issue 2 years ago • 3 comments

Thank you for your excellent work on the Grounded-Segment-Anything project. I've noticed that developers have recently incorporated various advanced SAM models, such as Efficient-SAM and RepViT-SAM. However, it appears that the Grounding-DINO module consumes most of the processing time in Grounded-SAM. As illustrated in the attached picture, while MobileSAM takes only 0.05s, Grounding-DINO requires 1.70s, which is significantly longer. Are there any plans to optimize the Grounding-DINO module, or is there an already available off-the-shelf solution? image

xiaobanni avatar Dec 21 '23 15:12 xiaobanni

Hello! For now, we do not have a smaller version of Grounding-DINO, you may replace grounding-dino with other light open-world models as the box prompt generator from the community.

rentainhe avatar Dec 22 '23 03:12 rentainhe

@rentainhe Thank you for your quick and friendly response. As I am not a professional in the field of Image segmentation, but just want to use its technology in downstream applications. After researching, I didn't find any significantly usable alternatives to Grounding-DINO. Could you recommend some potential solutions for me to try? Also, I found that this need might be common, as evidenced by the widespread discussion in the following link.

xiaobanni avatar Dec 23 '23 14:12 xiaobanni

Does GLIP have the same functions and effects? Compared with Grounding-DINO, can GLIP be seen as a combination of Grounding-DINO detector and BLIP? GLIP seems to have the functions of arbitrary text retrieval and object localization. Does it have the function of image description text output?

HaoqianSong avatar Jun 18 '24 08:06 HaoqianSong