muggled_sam Sam-hq 1 and 2

Hey,

SAM-HQ just released v2 based on SAM 2.1, their original v1 implementation were very good compared to v1, might be something to support if possible.

https://github.com/SysCV/sam-hq/tree/main/sam-hq2

Dec 02 '24 14:12 Preburk

Thanks for the heads up!

I wasn't originally planning to support the SAM variants (there's just a lot of them), though based on that link, it seems the HQ models are extremely similar to the originals. They even load/work in this repo (just missing some of the added processing), so I might give it a go at some point in the future. At the moment I'm just doing documentation work, so I probably won't get to it for at least a month or so...

Dec 02 '24 21:12 heyoeyo

Yes i understand there there is a lot of varieties out there and that you might have to pick and choose what you support. This one is as you say pretty straightforward in how they work and achieves pretty fantastic results(especially SAM1 vs HQ1). I also would like to thank you for the work you've done so far, both muggled repositories are awesome and we use them almost everyday for quick testing.

Dec 05 '24 14:12 Preburk

I also would like to thank you for the work you've done so far, both muggled repositories are awesome and we use them almost everyday for quick testing.

Thanks, that's super cool!

... achieves pretty fantastic results(especially SAM1 vs HQ1)

Even just loading the weights without proper support seems to work better than SAM1 which surprised me. The changes are so minor I might have a go at setting it up for testing (at least the v1 model) within the next week or so, probably on another branch since it'll be a bit of a hack to start. I'll post back here if it gets to a usable state.

For my own future reference (or anyone finding this, wanting to make the modifications before I get to it), the changes seem to be:

Update the image encoder to also output the 'stage 1' tokens, similar to how SAMv2 works (except only stage 1 and 4 are needed). The stage 1 tokens don't go through the output_projection like stage 4 though.
Compute the hq_features from the stage 1 tokens. Though the way this is used, it should be part of the image encoder instead of being in the mask decoder.
Include the hf_token (why hf not hq?) in the cls_tokens. It acts exactly like a 5th cls_mask_token.
Compute the upscaled_embedding_hq from the hq_features (from step 2) and the upscaled_img_tokens (called upscaled_embedding_sam in samhq).
Compute the encoded hf_token result. This works exactly like the existing mask tokens, even using the same MLP.
Compute the dot product result of the hf_token with the upscaled_embedding_hq (same as how masks are already calculated), this is the HQ mask result. It might need to be added to the base masks to get the final result.

Dec 06 '24 00:12 heyoeyo

I've just pushed a new branch with partial support for SAM-HQ (v1 only so far, since hq-v2 isn't documented yet): https://github.com/heyoeyo/muggled_sam/tree/feature/samhqv1

It doesn't support their 'vit_tiny' (since that's a different structure), but the HQ versions of the base/large/huge models should work. The run_image script is also updated to better handle the 'holes' in the masks that SAM-HQ can produce.

Dec 15 '24 17:12 heyoeyo

Nice christmas gift, will come in handy when doing quick tuning. Seem to work good so far.

Jan 10 '25 08:01 Preburk

Great! The v2 code is a bit of a maze to navigate, so it's hard to directly figure out what they changed/added to adapt it here. But if there's any documentation (now or maybe in the future) for the v2 HQ models, let me know!

Jan 11 '25 16:01 heyoeyo