muggled_sam How to search for reappearing objects? And identify same objects in the image while doing tracking?

Hey, I love your work. I need your help as stated in the question. How to improve SAM2 for reappearing objects and recapture same object in the video tracking using semantic information about the object. Also I read in the one of the threads, that SAM2 used the object mask and doubles the area and then uses that to track the object... Can you direct me to that thread. This would be of great help.

Or maybe if you can help me, how to increase the search area for the model to look for the object, outside the temporal encodings. I am open to more discussion in detail, if you are open to it.

Mar 11 '25 16:03 jinxzed

Thanks for checking out the repo!

doubles the area and then uses that to track the object

It's true that the model scales up the mask internally (it happens inside the _forward_sam_heads function), but it seems to only do it so that it can later downsample (with convolution) that same mask to match the size of the image encoding (typically 64x64). I don't think it has anything to do with where the model searches for the object or anything like that. It's complicated to follow the mask processing in the original code, but the equivalent sequence of steps can be found in the muggled sam code.

How to improve SAM2 for reappearing objects and recapture same object in the video tracking using semantic information about the object

This is sort of a detection task, which the SAM models aren't very good at. Generally, I think you'd need a model trained on detection based on the similarity requirements you need. The SAM models have their own sense of similarity, but it's very position sensitive (related example in issue #13). So they can generally do fine if the object reappears near where it left, but not if there's any major jumps in position (or if some similar object reappears in the same area).

If you need more of an exact match for re-tracking, then that's a bit like a person re-identification task. There are models for that, and I'd guess some of them support arbitrary 'backbone features' which could come from the SAM image encodings in this case. If you don't need exact matching (e.g. if you can assume that a person re-entering a scene is the same as the last person that left the scene), then you might be able to use a simpler detector like YOLO to re-detect the object for tracking.

If you'd like to avoid needing an extra model, you could try something like what the semantic similarity script does. The idea being to compare the image tokens of the initial prompted object to the image tokens in the frame when the object is missing to see if there's a high similarity match somewhere. Once you find a frame with a match, you can use the matching area to form a box prompt to restart tracking. At least, that's the idea in theory, though I wouldn't expect this to work as well as a dedicated detector model.

Mar 11 '25 19:03 heyoeyo

I got what you are trying to say. With that, further how do you suggest I can add a memory bank which stores frames which I select and then can be used according to when I lose track of the object (like RAG for vision :) )... I can query the bank based on the object size or appearance I want to reference the memory part and calculate similarity or something like that.

Mar 11 '25 21:03 jinxzed

how do you suggest I can add a memory bank which stores frames which I select and then can be used according to when I lose track of the object

The similarity score involves calculating an 'averaged token' from a reference masked object (the first part of the scoring calculation), which is something that only needs to be calculated once per reference object (e.g. prompted frame) and can then be stored for re-use. If you were to do this for several frames, you could store them in a list (acting as a sort of memory bank), and then every time you have a new frame where you want to calculate similarity, it would just mean running the similarity calculation (the second part of the scoring) against each averaged token in the list.

Alternatively, the other data type that could make sense to store are 'prompt memory encodings' (e.g. the init_mem from the segmentation example script). You can then 'query' a frame by running one step of video segmentation using each of the stored prompt memory encodings, and then checking the obj_score output, which will be larger when the tracking picks up a match between the current frame and provided memory data. Though this will be more sensitive to positioning of the object (since it's relying on the tracking capability of the model).

Mar 12 '25 12:03 heyoeyo

Hey, I experimented a lot with your implementation. Giving multiple prompts in scenraios where object was occluded or reappearing or barely visible, helped a lot... And even without storing any memory. Basically it helps the model to know its different appearances. It instantly gave a boost to the results. Thanks for masking this MUGGLED implementation 💯 .

I want to understand... when we provide the model with multiple prompts at different points in the video... your implementation uses the closest one with the current frame or all the input prompts before starting the object tracking ??

Mar 13 '25 08:03 jinxzed

That's great that using multiple prompts helped, it's a neat feature of these models!

... your implementation uses the closest one with the current frame or all the input prompts before starting the object tracking ??

If you use the run_video.py script, then it will use all provided prompts (not just the closest one to the frame). This includes 'future' prompts (i.e. if you skip the video ahead and store a prompt, then set the video back to the start, it will use that future prompt for segmentation even at the beginning of the video). This may or may not be desirable, depending on the use case.

More generally though, this is controllable. The prompt memories are held in two lists (or 'deques', which are just fancy lists), and these are given to the model on each frame, but you can alter what's given to the model here. So for example, instead of giving the entire list of prompt memories, you could give only the memory closest to the current frame index when running step_video_masking, or only give memories than happen before the frame index (e.g. no future frames), or maybe scale the memory tokens based on how far they are in the past etc. You can even give prompt encodings that are from a completely different video or image (which is what the video with image priors script is doing).

If you wanted to experiment with these things while still using the run_video script, then the main code where the tracking happens is here. You could read/modify the memory data (which is held in memory_list[objidx], which has a datatype of: SAM2VideoObjectResults) before it's passed into the step_video_masking function to get different behaviors.

Mar 13 '25 14:03 heyoeyo

I just thought of one thing... is there possibility of doing the following

If the object size (in terms of total pixels) becomes small below a value, can model stride or encoding or something else can be changed so more details can be taken into consideration when doing segmentation or recapturing the object

It will help enable model become adaptive to changes in the object appearances.

Mar 24 '25 05:03 jinxzed

can model stride or encoding be changed

It is possible to change the stride of the patch embedding, which determines the 'resolution' of the image tokens that the model works with. The model isn't trained for any other sizes, so the behavior may be less consistent:

# Modify patch embedding stride. Default is (4,4) for SAMv2
sammodel.image_encoder.patch_embed.proj.stride = (2, 2)  # Increase token count
sammodel.image_encoder.patch_embed.proj.stride = (8, 8)  # Decrease token count

Halving the stride (e.g. using 2x2) will double the number of tokens, so you'd get a 128x128 image encoding using default settings (normally 64x64 with original 4x4 stride). Though if this is being done during tracking, it would require restarting the tracking (i.e. re-prompting), since prior memory encodings won't match the new sizing.

Alternatively, cropping can be used to 'zoom in' on a part of the image (e.g. like using --crop with the run_image/video scripts), which may be especially good if the video resolution is higher than the 1024x1024px processing resolution. This doesn't change the token resolution, but may still require re-prompting during tracking, since the object appearance would be different in the cropped image.

Mar 24 '25 13:03 heyoeyo

Hey Buddy, I was going through your code as to how you were able to extend the usable memory length. I see in your code that when creating the buffer index list, you limit the model to have index max as buffer_idx_list = [min(idx, self._max_mempos_idx) for idx in buffer_idx_list] which can never be more than the embedding of maskmem_tpos_enc in original model, so essentially you repeat the position encoding of extended frames with the last embedding from the base mem position offsets. Which basically means to repeat that frame in the memory. I hope I am right till now.

My question to you is, is there a more correct way to extend the memory length for the model. What if I do something like this

    mem_embed = torch.arange(1, self.num_maskmem + 1, dtype=torch.float32).sin().repeat(64)
    mem_embed = mem_embed.reshape(-1, self.num_maskmem).view(self.num_maskmem, 1, 1, -1)
    self.maskmem_tpos_enc = torch.nn.Parameter(mem_embed)

will this be correct? if not, can you suggest me a better method to achieve the memory extension.

Apr 07 '25 11:04 jinxzed

Which basically means to repeat that frame in the memory. I hope I am right till now

Yes, that's correct. Just to clarify, only the 'time' position encoding is repeated, not the entire frame data itself. There are only 6 position encodings in the model, so if you store more than 6 previous frame memory encodings, then all encodings after the 6th will be interpreted as happening 'at the same time' in the past (but the memory encoding will still be unique for each frame).

My question to you is, is there a more correct way to extend the memory length for the model

The maskmem_tpos_enc is a learned parameter in the original model, so extending it would require re-training the model. Setting it to a computed value (e.g. sin, like in your example) could allow the model to support more values, but would also require re-training the model to interpret those values properly.

Alternatively, you could try interpolating the existing encodings, similar to how the image encoder interpolates it's position encodings to match different input sizes. This would at least make it so that a unique position encoding can be given, no matter how many memory encodings are needed. However, since the model isn't trained for these, it could degrade performance in some cases (though the model isn't very sensitive to these encodings, so I wouldn't expect this would have a major impact).

Apr 07 '25 13:04 heyoeyo

Alternatively, you could try interpolating the existing encodings, similar to how the image encoder interpolates it's position encodings to match different input sizes. This would at least make it so that a unique position encoding can be given, no matter how many memory encodings are needed. However, since the model isn't trained for these, it could degrade performance in some cases (though the model isn't very sensitive to these encodings, so I wouldn't expect this would have a major impact).

Yes, as you said, I ran the model (with sin)with this and the performance has degraded. Maybe I will try the interpolation of position encoding as in image encoder.

Apr 07 '25 15:04 jinxzed

Maybe I will try the interpolation of position encoding as in image encoder.

I should have added, if you do interpolate (or even use a computed sin value), it's best to only change/interpolate the first 6 entries. The last one is used to encode that the memory is due to prompting and has a much stronger effect (it also isn't time-based, like the other entries) than the other 6, so it's best to leave it as-is.

Also, the memory time position encodings are initialized with a normal distribution, so the vectors themselves may be scattered around the origin. Regular (linear) interpolation could generate new vectors very near zero, which is probably bad. So it may be better to use something like spherical interpolation instead.

Apr 07 '25 16:04 heyoeyo