Sam-hq 1 and 2
Hey,
SAM-HQ just released v2 based on SAM 2.1, their original v1 implementation were very good compared to v1, might be something to support if possible.
https://github.com/SysCV/sam-hq/tree/main/sam-hq2
Thanks for the heads up!
I wasn't originally planning to support the SAM variants (there's just a lot of them), though based on that link, it seems the HQ models are extremely similar to the originals. They even load/work in this repo (just missing some of the added processing), so I might give it a go at some point in the future. At the moment I'm just doing documentation work, so I probably won't get to it for at least a month or so...
Yes i understand there there is a lot of varieties out there and that you might have to pick and choose what you support. This one is as you say pretty straightforward in how they work and achieves pretty fantastic results(especially SAM1 vs HQ1). I also would like to thank you for the work you've done so far, both muggled repositories are awesome and we use them almost everyday for quick testing.
I also would like to thank you for the work you've done so far, both muggled repositories are awesome and we use them almost everyday for quick testing.
Thanks, that's super cool!
... achieves pretty fantastic results(especially SAM1 vs HQ1)
Even just loading the weights without proper support seems to work better than SAM1 which surprised me. The changes are so minor I might have a go at setting it up for testing (at least the v1 model) within the next week or so, probably on another branch since it'll be a bit of a hack to start. I'll post back here if it gets to a usable state.
For my own future reference (or anyone finding this, wanting to make the modifications before I get to it), the changes seem to be:
- Update the image encoder to also output the 'stage 1' tokens, similar to how SAMv2 works (except only stage 1 and 4 are needed). The stage 1 tokens don't go through the output_projection like stage 4 though.
- Compute the hq_features from the stage 1 tokens. Though the way this is used, it should be part of the image encoder instead of being in the mask decoder.
- Include the hf_token (why hf not hq?) in the cls_tokens. It acts exactly like a 5th cls_mask_token.
- Compute the upscaled_embedding_hq from the
hq_features(from step 2) and the upscaled_img_tokens (calledupscaled_embedding_samin samhq). - Compute the encoded hf_token result. This works exactly like the existing mask tokens, even using the same MLP.
- Compute the dot product result of the
hf_tokenwith theupscaled_embedding_hq(same as how masks are already calculated), this is the HQ mask result. It might need to be added to the base masks to get the final result.
I've just pushed a new branch with partial support for SAM-HQ (v1 only so far, since hq-v2 isn't documented yet): https://github.com/heyoeyo/muggled_sam/tree/feature/samhqv1
It doesn't support their 'vit_tiny' (since that's a different structure), but the HQ versions of the base/large/huge models should work. The run_image script is also updated to better handle the 'holes' in the masks that SAM-HQ can produce.
Nice christmas gift, will come in handy when doing quick tuning. Seem to work good so far.
Great! The v2 code is a bit of a maze to navigate, so it's hard to directly figure out what they changed/added to adapt it here. But if there's any documentation (now or maybe in the future) for the v2 HQ models, let me know!