segment-anything icon indicating copy to clipboard operation
segment-anything copied to clipboard

Detailed architecture of the SAM (Segment anything model)

Open jetsonwork opened this issue 1 year ago • 3 comments

Hello everyone,

I am trying to draw the architecture of SAM (not the one available in the original paper), including all the details. Could you please guide me on how to proceed? if someone has already done this, I would appreciate it if you could share it with me.

jetsonwork avatar Feb 10 '24 09:02 jetsonwork

For the image encoder specifically, the SAM model uses the 'plain' architecture described in the paper: "Exploring Plain Vision Transformer Backbones for Object Detection"

However, if you're looking for 'all the details', then the code is definitely the best place to be looking. All of the model code is under the segment_anything > modeling folder, and it's very well organized/straight-forward compared to most other model implementations that I've seen.

heyoeyo avatar Feb 10 '24 16:02 heyoeyo

Thanks. I have studied some blogs and they have mentioned the encoder architecture is based on MAE ViT as below. Could you please confirm it?

image

jetsonwork avatar Feb 10 '24 16:02 jetsonwork

Yes the paper mentions they started with a model trained as a MAE, though the actual model in SAM isn't an autoencoder (the output is 64x64x256 compared to the 1024x1024x3 input), I assume they just removed the decoder part.

heyoeyo avatar Feb 10 '24 20:02 heyoeyo