muggled_sam How to get mask of an object for just a segment of the video?

Thank you for making a GUI SAM2 for code-unfriendly users like me.

I must be stupid cuz I still can't figure out how the buffers or histories work.

In a video with 10 frames for example, I'd like to segment object 1 from frame 2 to 5, then segment object 2 from frame 7 to 9, and store the two object masks results.

How can I achieve this while using run_video.py?

Mar 11 '25 06:03 DonggeonKim1012

I feel the need to elaborate,

I give a box prompt for object 1 in frame 2, enable recording and track til frame 5.

I disable recording, play the video until frame 7 where I pause.

Then I choose buffer 2,give box prompt for object 2, enable recording and start tracking til frame 9.

I click save buffer, and the results are object 1 from frame 2 to 9, and object 2 from frame 7 to 9.

May I ask how to make the former 2 to 5?

Mar 11 '25 06:03 DonggeonKim1012

Oh I should press the clear_prompt after frame 5, shouldn't I 😆

Lol thanks for the great work 👍

Aside from this, I tried implementing real-time sam myself, but I ran into some issues notably, https://github.com/facebookresearch/sam2/issues/535

I'm thinking of decreasing the contour like muggled-sam does to avoid bleeding. Could you tell me which part of the original sam2 code I should try changing, please?

Thanks again 🚀

Mar 11 '25 06:03 DonggeonKim1012

Glad it worked out!

The UI is pretty confusing/not documented. To clarify, the buffers just record whatever segmentation is on screen for each frame (so clearing the prompt if you want it to stop saving is the right approach). You can add more buffers by running the script the the -n flag, if you want to keep track of more than 4 objects. The 'history' is poorly named, but refers to the 'recent frame' memory data the model uses as a kind of self-prompt to keep track of the object as it changes (compared to the initial prompt memory). It's there mostly out of curiosity (to see how the model behaves with/without that data), but generally you'd want to clear it if you want to reset the prompting.

I'm thinking of decreasing the contour like muggled-sam does to avoid bleeding

I've never seen that before actually! With muggled sam, the only thing that happens on each video frame is that the frame is passed through the image encoder, and then the memory/mask/encoding steps are applied using the object memory bank data (these two lines of code). So muggled sam isn't doing anything to correct/modify masks over time to prevent this.

Does the mask bleeding happen if you only segment the green object? (i.e. don't include the orange object at all). The original SAM code has some extra steps for 'consolidating' masks, which I never understood, that may be causing this...?

The only other thing I can think of is if multiple prompts for the green object are being repeated across frames (e.g. the box prompt on frame 2 is again applied on frame 3, then frame 4, etc. instead of just once on frame 2). Once the object is turned, then if the same box prompt was encoded again, it would include a lot of the table behind the object, which might push the model towards including the table in the segmentation.

If you did want to shrink the mask as the model is running, you could try subtracting some constant value from the SAM mask predictions (just before this line):

shrink_amount = 1.5
low_res_multimasks = low_res_multimasks - shrink_amount

However, this will affect all masks, which might hurt the orange object, and it would be hard to control/set proper values. I guess you could try to use the object IDs to control the subtraction per object, but that wouldn't generalize very well.

If you want to shrink the displayed (binary) mask, you can use morphological processing (specifically 'erosion') . The code is something like (you can change the MORPH_ELLIPSE and even MORPH_ERODE to get different effects):

# Code used to shrink a mask
# -> Assumes you have a numpy uint8 binary mask called: mask_uint8
shrink_amount_xy = [11, 11]
morph_kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, shrink_amount_xy)
smaller_mask = cv2.morphologyEx(mask_uint8, cv2.MORPH_ERODE, morph_kernel)

Though this would just be a display effect, it wouldn't change the model behavior.

Mar 11 '25 13:03 heyoeyo

Thank you for such a kind and detailed reply!

Bleeding doesn't occur when there's only one object. It happens specifically when there are multiple objects, and the prompt is no longer given to a previously masked object. The object doesn't necessarily have to turn for bleeding to happen. 🤔

The only other thing I can think of is if multiple prompts for the green object are being repeated across frames (e.g. the box prompt on frame 2 is again applied on frame 3, then frame 4, etc. instead of just once on frame 2).

I've actually used yolo to give prompts, so a design issue resulting in duplicate bounding boxes over frames just might be the cause. I initially thought a smaller mask would help avoid covering parts of the table, so it's a shame that the feature is only for displaying.

I am still amazed how your's works for webcams too! I'm planning to alter muggled sam if the suggested fix doesn't work. My goal is to prompt a mask for a new object, while keeping track of the previous prompts.

I'll keep you updated on my progress. Cheers!

Mar 12 '25 06:03 DonggeonKim1012

Bleeding doesn't occur when there's only one object.

It may have something to do with the consolidation steps then (the _consolidate_temp_output_across_obj function in the original code, though I really can't figure out what this function is supposed to do...). There was also another issue on the SAM2 repo (issue #249) about weird stuff happening when dealing with multiple objects that might be worth checking out if you haven't already.

I've actually used yolo to give prompts

The SAM2 models are really sensitive to box prompts and anything other than a tight fitting box can cause problems, so it could be worth doubling checking the yolo detections aren't padded. Here's an example:

The small box gives a good segmentation, but a bit bigger box actually grabs the background. However, if the problem goes away when working with a single object, then I think the yolo prompts must be ok.

I am still amazed how your's works for webcams too! ... My goal is to prompt a mask for a new object, while keeping track of the previous prompts

Thanks!

If you end up using muggled sam, there's an example script for tracking multiple objects that start at different times. It's based on knowing the start times in advance, but if you're doing it dynamically, it would just be a matter of modifying it to check 'if yolo detected an object' then use the detection as a prompt for a new object (which is done by these lines).

Mar 12 '25 15:03 heyoeyo

Hi! I've implemented real-time TAM by updating the original codebase, you can check it out here: https://github.com/robrosinc/REALTIME_SAM2

Turns out the mask bleeding problem was indeed due to yolo :/ Maybe its TAM but my implementation still is very fragile when it comes to tracking the masked objects.

I guess it is redundant work, knowing that muggled sam already supports webcam, but just wanted to give you an update on what I've been working on. Thanks again for your suggestions!

Apr 10 '25 14:04 DonggeonKim1012

Awesome! I'll have to check it out, TAM seems impressively fast!

Apr 10 '25 19:04 heyoeyo