segment-anything Documentation on the differences between the different models

Currently, there are three model type available.

default or vit_h: ViT-H SAM model.

vit_l: ViT-L SAM model.

vit_b: ViT-B SAM model.

I could not find any documentation on the difference between them. Is there any available? If not, could someone elaborate on that?

Many thanks.

Apr 22 '23 15:04 eduardo4jesus

There is a paper accompanying the repository. The models are the same except for neural network size, B stands for "base" and is the smallest, L is "large" and H is "huge". The paper reports that the performance difference between L and H isn't much and I would recommend L if your machine supports it. However, B is lighter and not far behind in performance.

Apr 23 '23 14:04 franchesoni

@franchesoni, thank you so much. I added a PR #300 on this. I would appreciate to have your feedback.

Apr 28 '23 02:04 eduardo4jesus

I've run extensive testing on the models using a wide variety of images. Here is a part of the print-log used when testing and a sample image (locally on my RTX3080):

vit_h Registering model... 12:48:03 Reading image... 12:48:08 Making masks... 12:48:08 Done at: 12:48:14 | Amount: 13 Making image from mask... 12:48:14 Done...? | 12:48:17 | Time taken: 10.918561458587646 vit_l Registering model... 12:48:17 Reading image... 12:48:19 Making masks... 12:48:19 Done at: 12:48:22 | Amount: 17 Making image from mask... 12:48:22 Done...? | 12:48:25 | Time taken: 5.4358086585998535 vit_b Registering model... 12:48:25 Reading image... 12:48:26 Making masks... 12:48:26 Done at: 12:48:28 | Amount: 10 Making image from mask... 12:48:28 Done...? | 12:48:31 | Time taken: 2.9744369983673096

I found that on average vit_l has the best performance/accuracy-tradeoff. vit_h is the most accurate but slowest, and vit_b the fastest but the least accurate.

May 05 '23 01:05 pinksloyd