Segment-and-Track-Anything icon indicating copy to clipboard operation
Segment-and-Track-Anything copied to clipboard

HIL full sequence prompting for drifts and ambiguities

Open bhack opened this issue 1 year ago • 15 comments

Do you think that prompted frames (e.g. to fix drift) could be dinamically added in the long term memory as reference frame? Or we could add this as an explicit option.

bhack avatar Apr 21 '23 00:04 bhack

Hi, actually the current implementation of automatic segmentation and tracking does add long term memory every sam_gap frame. However, we only select objects from background as newly appeared objects, and the new reference masks only include these new objects. As a result, SamTrack is able to find new objects in a video and then track them.

yoxu515 avatar Apr 22 '23 07:04 yoxu515

What is the plan with the interactive version? E.g. also in your basketball player demo video there are frames to fix with more prompting. Are these added to the long term memory or you plan to just use the new segmentation mask as prev/short term memory?

bhack avatar Apr 22 '23 10:04 bhack

What is the plan with the interactive version? E.g. also in your basketball player demo video there are frames to fix with more prompting. Are these added to the long term memory or you plan to just use the new segmentation mask as prev/short term memory?

Adding more prompting to long-term memory may help the results, so we may make it optional for users, considering that more long-term memory frames also require more computing resources.

yoxu515 avatar Apr 22 '23 11:04 yoxu515

Yes I am guessing about the best pipeline in the interactive demo. Suppose that at frame X you want to fix segmentation errors/Deaot drift with extra prompting, like in your demo Video, to fix the segmentation between the finger and the ball what we want to do then with this HIL "fixed segmentation"? immagine

bhack avatar Apr 22 '23 11:04 bhack

Yes I am guessing about the best pipeline in the interactive demo. Suppose that at frame X you want to fix segmentation errors/Deaot drift with extra prompting, like in your demo Video demo, to fix the segmentation between the finger and the ball what we want to do then with this HIL "fixed segmentation"? immagine

Some extra point prompts may be useful to indicate the foreground and background around fingers. The fixed segmentation mask can be propagated to later frames, as either long-term memory or short-term memory.

yoxu515 avatar Apr 22 '23 11:04 yoxu515

Yes exactly, I am think how to best use in the memory the extra HIL prompting in the sequence. Here another frame of your demo video: immagine

bhack avatar Apr 22 '23 11:04 bhack

And here another one: immagine

So I think HIL prompting and Deaot memories will need to cooperate to recover propagation failures.

bhack avatar Apr 22 '23 12:04 bhack

In the current implementation, we only use the first frame as long-term memory. For best performance, the fixed segmentation mask, as well as some intermediate frames, should be added to the long-term memory. But as I mentioned, more long-term memory will increase the computational burden. There is a trade-off between segmentation quality and memory capacity. We will consider the quality improvement after finishing building the basic framework of the interactive version.

yoxu515 avatar Apr 22 '23 12:04 yoxu515

Yes I meant it is not really required to be added to the long term memory it could also just go in the short term one. But I think you need to pay attention about time coherence of the segmentation as SAM enc/dec it is not "propagation aware" as the Deaot enc/dec finetuned on DAVIS+Youtube. So probably a SAM mask HIL it will be need to be encoded/decoded in in Deaot enc/dec when we have an HIL frame. If not, after x HIL frames, we will have a lot of time incoherence in the output segmentation sequence.

bhack avatar Apr 22 '23 12:04 bhack

Yes I meant it is not really required to be added to the long term memory it could also just go in the short term one. But I think you need to pay attention about time coherence of the segmentation as SAM enc/dec it is not "propagation aware" as the Deaot enc/dec finetuned on DAVIS+Youtube. So probably a SAM mask HIL it will be need to be encoded/decoded in in Deaot enc/dec when we have an HIL frame. If not, after x HIL frames, we will have a lot of time incoherence in the output segmentation sequence.

Thank you for your advice, we will consider it during developing the interactive part.

yoxu515 avatar Apr 22 '23 13:04 yoxu515

I think it is important as they have the same propagation error to handle also with SAM+XMEM https://github.com/gaomingqi/Track-Anything

bhack avatar Apr 23 '23 10:04 bhack

I think it is important as they have the same propagation error to handle also with SAM+XMEM https://github.com/gaomingqi/Track-Anything

Yes, we have also notice this work, which has the similar idea with ours. Both of them are good start for appyling SAM on video segmentation, while we think AOT is better at processing multiple objects. Besides, ours supports automatic segmentation and tracking of all objects in the video.

yoxu515 avatar Apr 23 '23 10:04 yoxu515

Yes this is true but in any case we will never have the perfect propagation network so it is a required step to handle the HIL prompting on the sequence at best to recover drifts and ambiguities using the WEB UI and the memory of the tracker.

bhack avatar Apr 23 '23 10:04 bhack

It is like how they approach it at step 3 and 4 (page 3) https://arxiv.org/abs/2304.11968

It would be also nice if you can calculate and publish your SAM+Deaotl baseline on common dataset, before the interactive refinement, as in this technical report.

bhack avatar Apr 27 '23 08:04 bhack

I saw you mentioned something in the Readme markdown with demo 6 and demo 7 videos. What is the plan?

bhack avatar May 16 '23 12:05 bhack