Segment-and-Track-Anything
Segment-and-Track-Anything copied to clipboard
HIL full sequence prompting for drifts and ambiguities
Do you think that prompted frames (e.g. to fix drift) could be dinamically added in the long term memory as reference frame? Or we could add this as an explicit option.
Hi, actually the current implementation of automatic segmentation and tracking does add long term memory every sam_gap frame. However, we only select objects from background as newly appeared objects, and the new reference masks only include these new objects. As a result, SamTrack is able to find new objects in a video and then track them.
What is the plan with the interactive version? E.g. also in your basketball player demo video there are frames to fix with more prompting. Are these added to the long term memory or you plan to just use the new segmentation mask as prev/short term memory?
What is the plan with the interactive version? E.g. also in your basketball player demo video there are frames to fix with more prompting. Are these added to the long term memory or you plan to just use the new segmentation mask as prev/short term memory?
Adding more prompting to long-term memory may help the results, so we may make it optional for users, considering that more long-term memory frames also require more computing resources.
Yes I am guessing about the best pipeline in the interactive demo.
Suppose that at frame X you want to fix segmentation errors/Deaot drift with extra prompting, like in your demo Video, to fix the segmentation between the finger and the ball what we want to do then with this HIL "fixed segmentation"?
Yes I am guessing about the best pipeline in the interactive demo. Suppose that at frame X you want to fix segmentation errors/Deaot drift with extra prompting, like in your demo Video demo, to fix the segmentation between the finger and the ball what we want to do then with this HIL "fixed segmentation"?
Some extra point prompts may be useful to indicate the foreground and background around fingers. The fixed segmentation mask can be propagated to later frames, as either long-term memory or short-term memory.
Yes exactly, I am think how to best use in the memory the extra HIL prompting in the sequence.
Here another frame of your demo video:
And here another one:
So I think HIL prompting and Deaot memories will need to cooperate to recover propagation failures.
In the current implementation, we only use the first frame as long-term memory. For best performance, the fixed segmentation mask, as well as some intermediate frames, should be added to the long-term memory. But as I mentioned, more long-term memory will increase the computational burden. There is a trade-off between segmentation quality and memory capacity. We will consider the quality improvement after finishing building the basic framework of the interactive version.
Yes I meant it is not really required to be added to the long term memory it could also just go in the short term one. But I think you need to pay attention about time coherence of the segmentation as SAM enc/dec it is not "propagation aware" as the Deaot enc/dec finetuned on DAVIS+Youtube. So probably a SAM mask HIL it will be need to be encoded/decoded in in Deaot enc/dec when we have an HIL frame. If not, after x HIL frames, we will have a lot of time incoherence in the output segmentation sequence.
Yes I meant it is not really required to be added to the long term memory it could also just go in the short term one. But I think you need to pay attention about time coherence of the segmentation as SAM enc/dec it is not "propagation aware" as the Deaot enc/dec finetuned on DAVIS+Youtube. So probably a SAM mask HIL it will be need to be encoded/decoded in in Deaot enc/dec when we have an HIL frame. If not, after x HIL frames, we will have a lot of time incoherence in the output segmentation sequence.
Thank you for your advice, we will consider it during developing the interactive part.
I think it is important as they have the same propagation error to handle also with SAM+XMEM https://github.com/gaomingqi/Track-Anything
I think it is important as they have the same propagation error to handle also with SAM+XMEM https://github.com/gaomingqi/Track-Anything
Yes, we have also notice this work, which has the similar idea with ours. Both of them are good start for appyling SAM on video segmentation, while we think AOT is better at processing multiple objects. Besides, ours supports automatic segmentation and tracking of all objects in the video.
Yes this is true but in any case we will never have the perfect propagation network so it is a required step to handle the HIL prompting on the sequence at best to recover drifts and ambiguities using the WEB UI and the memory of the tracker.
It is like how they approach it at step 3 and 4 (page 3) https://arxiv.org/abs/2304.11968
It would be also nice if you can calculate and publish your SAM+Deaotl baseline on common dataset, before the interactive refinement, as in this technical report.
I saw you mentioned something in the Readme markdown with demo 6 and demo 7 videos. What is the plan?