segment-anything Detailed architecture of the SAM (Segment anything model)

Detailed architecture of the SAM (Segment anything model)

Open jetsonwork opened this issue 1 year ago • 3 comments

Hello everyone,

I am trying to draw the architecture of SAM (not the one available in the original paper), including all the details. Could you please guide me on how to proceed? if someone has already done this, I would appreciate it if you could share it with me.

Feb 10 '24 09:02 jetsonwork

For the image encoder specifically, the SAM model uses the 'plain' architecture described in the paper: "Exploring Plain Vision Transformer Backbones for Object Detection"

However, if you're looking for 'all the details', then the code is definitely the best place to be looking. All of the model code is under the segment_anything > modeling folder, and it's very well organized/straight-forward compared to most other model implementations that I've seen.

Feb 10 '24 16:02 heyoeyo

Thanks. I have studied some blogs and they have mentioned the encoder architecture is based on MAE ViT as below. Could you please confirm it?

Feb 10 '24 16:02 jetsonwork

Yes the paper mentions they started with a model trained as a MAE, though the actual model in SAM isn't an autoencoder (the output is 64x64x256 compared to the 1024x1024x3 input), I assume they just removed the decoder part.

Feb 10 '24 20:02 heyoeyo

segment-anything segment-anything copied to clipboard

Detailed architecture of the SAM (Segment anything model)

segment-anything
segment-anything copied to clipboard