Clarification on Prompting Methods, SAM2VideoBuffer Index Use, and Prompt Management Practices
Hi, and thank you again for this fantastic repo 🙏 — it's been a huge help in my work!
I've implemented a hybrid architecture inspired by your project and have a few follow-up questions on prompting behavior and memory management.
1. On the idx Field in SAM2VideoBuffer
- After some experiments and diving into the code, I haven’t observed any effect of the
idxvalues during tracking or memory fusion.- Is it used implicitly in any attention mechanisms or anywhere else I may have missed?
- If not, is this behavior consistent with the original SAM2 implementation?
- In
clear(), theidxbuffer isn't reset by default. Is this intentional, or should it be cleared alongside memory/pointers?
2. Prompting Techniques and Alternatives
2.1 initialize_video_masking Usage
- My understanding is that after calling
initialize_video_masking(), we must immediately callstore_prompt_result(frame_idx, memory, ptr)to finalize prompt registration. - Is this the recommended method for adding multiple prompts in any frame, even future frames? before calling
step_video_masking?
2.2 Prompting Inside step_video_masking
- I created a version of
step_video_maskingthat directly accepts new prompt inputs (e.g., boxes, points) during tracking — bypassinginitialize_video_masking.
def step_video_masking_with_prompts(sam_model: SAMV2Model,
encoded_image_features_list: list[torch.Tensor],
box_tlbr_norm_list: list,
fg_xy_norm_list: list,
bg_xy_norm_list: list,
prompt_memory_encodings: list[torch.Tensor],
prompt_object_pointers: list[torch.Tensor],
previous_memory_encodings: list[torch.Tensor],
previous_object_pointers: list[torch.Tensor],
mask_hint: torch.Tensor | int | None = None,
mask_index_select: int | None = None) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
encoded_prompts = sam_model.encode_prompts(box_tlbr_norm_list, fg_xy_norm_list, bg_xy_norm_list)
with torch.inference_mode():
lowres_imgenc, *hires_imgenc = encoded_image_features_list
memfused_encimg = sam_model.memory_fusion(
lowres_imgenc,
prompt_memory_encodings,
prompt_object_pointers,
previous_memory_encodings,
previous_object_pointers,
)
patch_grid_hw = memfused_encimg.shape[2:]
grid_posenc = sam_model.coordinate_encoder.get_grid_position_encoding(patch_grid_hw)
mask_preds, iou_preds, obj_ptrs, obj_score = sam_model.mask_decoder(
[memfused_encimg, *hires_imgenc],
encoded_prompts,
grid_posenc,
mask_hint=None,
blank_promptless_output=False,
)
best_mask_idx, best_mask_pred, _, best_obj_ptr = sam_model.mask_decoder.get_best_decoder_results(
mask_preds,
iou_preds,
obj_ptrs,
exclude_0th_index=True,
)
memory_encoding = sam_model.memory_encoder(lowres_imgenc, best_mask_pred, obj_score, is_prompt_encoding=True)
return obj_score, best_mask_idx, mask_preds, memory_encoding, best_obj_ptr
- The results differ from those using the full prompt memory flow. (when first storing the prompts after initialization)
- Is this a valid use case for live tracking prompting? Would the SAM2 architecture diagram differ in such case?
- When should one prefer each prompting method?
2.3 Incremental Prompting and Clearing
- I've experimented with the flow:
-
store_prompt_result()- of the first frame -
store_prompt_result()- of the current frame -
step_video_masking() - results_storage.store_result(idx, mem_enc, obj_ptr)
-
results_storage.prompts_buffer.clear()
-
- this yields only 2 prompts (first frame, and current's frame) for every processed frame.
- This helps reduce GPU memory and mitigate over-reliance on outdated prompts (e.g., in object reappearances).
- Does this resemble any intended use case? Any thoughts on best practices?
- In the original SAM2 repo, was this pattern used or recommended?
Thanks again for your detailed replies and your great work on this repo — it’s been instrumental in enabling research exploration like mine! 🙌
Thanks again for your detailed replies and your great work on this repo — it’s been instrumental in enabling research exploration like mine!
That's awesome to hear!
- On the idx Field in SAM2VideoBuffer
You're right that the indexing isn't used by the model. It's originally there for debugging purposes (to verify which frames were recorded). The clearing behavior isn't intentional, there probably should be separate indexes for the memory vs. pointers, so the indexing could be cleared for each independently.
is this behavior consistent with the original SAM2 implementation?
The v2.0 models don't use frame index values directly. Some of the code for managing the memory bank could optionally make use of frame indexing (e.g. to exclude memory data coming from the future or only use the 'closest' memory encoding in time), and that code is mixed in to parts of the model code even though it's sort of separate from the model. In muggled sam, this sort of functionality would come from user-code that determines what memory encodings are provided when running the step_video_masking function.
In the v2.1 update, there was some code added to the models which does make use of the frame indexing directly within the model, specifically in handling the object pointers. Muggled sam doesn't replicate this exactly and instead does something a bit simpler (the implementation/explanation is here) to avoid the extra data/complexity of dealing with frame indexing.
... we must immediately call store_prompt_result(frame_idx, memory, ptr) to finalize prompt registration.
This isn't strictly necessary, but probably the most common way to do things. The function initialize_video_masking is poorly named, a better name might be encode_prompt_memory as it doesn't have anything to do with model state or initializing anything internally. It's just that (generally) tracking requires a starting prompt to operate properly, which is what that function returns.
Is this the recommended method for adding multiple prompts in any frame, even future frames?
It depends on what you mean by 'multiple prompts' here.
a) If it's multiple prompts due to there being multiple objects to track, then yes you'd want to generate separate prompt memory for each object.
b) If it's multiple prompts for a single object (e.g. multiple foreground/background points and/or box prompts), then it's maybe better to provide those as a single prompt (e.g. all the foreground points are given in the fg_xy_norm_list for example).
c) If it's multiple prompts on a single object, but for different times (e.g. like you mentioned, a future frame), then it would depend on whether you want the prompts from other points in time to influence the tracking, which is going to be very video dependent. Though either way you can still generate all of the prompt memory encodings initially and just not give them to the model until later frames, if that's easier to manage/implement for your use case.
2.2 Prompting Inside step_video_masking - Would the SAM2 architecture diagram differ in such case?
All of the steps/components/architecture would be the same in this case. The original model does technically get 'prompts' on every frame during tracking, it's just that it's a special double 'non-point' prompt, whereas the changes you mentioned would support any kind of prompting, so there's added flexibility.
The results differ from those using the full prompt memory flow
The model is very sensitive to the double non-point prompt while tracking, so including this (in addition to other prompts) may produce more similar results. That being said, I don't think the model is trained to have prompts during tracking (judging by it's sensitivity to the non-point prompts), so the results may generally be less predictable.
Is this a valid use case for live tracking prompting?
I'd say it's valid. The original implementation doesn't have any examples of doing this, but it makes sense as the right way to implement prompts during tracking.
When should one prefer each prompting method?
Having prompts available while tracking is an unusual use-case I think, since it would mean that something else is already doing some sort of tracking. That leads to a question of whether the SAM tracking is needed at all. It could be better to use SAM in it's image segmentation mode for example, which runs ~2x faster than video segmentation (assuming something else is generating prompts that track the object for every frame). However I can imagine it being used to guide/correct the SAM predictions if they deviate too far from the prompt (e.g. restarting tracking if SAM predicts a mask that doesn't match well with the prompt). So I think that would be the main reason to incorporate prompts during tracking.
2.3 Incremental Prompting and Clearing - Does this resemble any intended use case?
The original model did have support for using only the 'newest' prompt memory (the function mentioned earlier), but it's disabled by default. Otherwise the 'previous frame memory' is meant to perform this role of handling changes over time, though it's making the assumption that the model predictions are always reliable.
I think your approach can make sense, especially if the prompts are considered more reliable then the model's own predictions. It could prevent the model from following/switching to the wrong object, which is a common way it fails.
Any thoughts on best practices?
It could make sense to only clear/reset the prompt memory if the object_score is below some threshold (below 3 or 4 maybe?), since that would indicate the tracking is not confident. But this would mostly be for reducing the amount of computation being done not for accuracy/quality of the results.
In the original SAM2 repo, was this pattern used or recommended?
No, the original repo manages the memory bank internally, so this kind of manipulation of the memory isn't really supported there. Though I think this is more of an implementation quirk than a matter of it being recommended or not.