Segment-and-Track-Anything Key Differences Between Text-Prompt with Automatic Tracking and Invoking Grounded-SAM/Lang-SAM Per Frame

Thank you for the wonderful project. I observed that it features text-prompt and automatic Tracking Mode, allowing me to obtain the segment mask for each frame based on the pre-provided text-prompt. However, I would like to understand the key difference with invoking Grounded-SAM/Lang-SAM each frame with text-prompt.

Oct 22 '23 07:10 xiaobanni

The key difference is that we additionally used the CMR module to determine whether the detected objects of Grounding-DINO are newly appearing objects in the video.

Nov 04 '23 06:11 yamy-cheng

So, if I only need to get the objects based on the text prompt, I just need to invoke the Grounding DINO to get the boxes and then send them to SAM, right?

Nov 06 '23 07:11 xiaobanni

So, if I only need to get the objects based on the text prompt, I just need to invoke the Grounding DINO to get the boxes and then send them to SAM, right?

yes, you are right.

Nov 08 '23 12:11 yamy-cheng

I am a researcher in a different field who wants to utilize the text-to-segment ability for long videos. I have found that invoking Grounding DINO is a time-consuming process, which is an unacceptable expenditure for many demands. I wonder whether leveraging the properties of the video continuum could be a method to circumvent the need to involve Grounding DINO. However, this might require additional processing to determine whether a new entity that satisfies the text prompt occurs in each timeframe. I hope that researchers related to this field can explore this type of demand further. Additionally, I welcome any recommendations for suitable projects.

Nov 09 '23 07:11 xiaobanni