Degraded results from changing base model size
I have been reading your exciting results about modifying SAM2's image_size in this thread and trying to implement them in my own workflow.
For context I am working with images from a microscope that are mostly in image shapes (1024, 1024), (1536, 1536), (2048, 2048), (3072, 3072).
I had tried implementing your strategy of changing the image_size setting (inside the model .yaml configs) and subclassing the SAM2ImagePredictor to
from sam2.sam2_image_predictor import SAM2ImagePredictor as _SAM2ImagePredictor
class SAM2ImagePredictor(_SAM2ImagePredictor):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
hires_size = self.model.image_size // 4
self._bb_feat_sizes = [[hires_size // (2**k)] * 2 for k in range(3)]
I was unable to find your pretty turtle image through reverse image search so you will have to excuse me changing that parameter.
I cropped this Wikipedia photo of an American avocet to 3072x3072 and attempted to segment it with standard SAM2 and the modified version above.
Base SAM2
Modified SAM2 with `image_size=3072`
My general finding for these modified image_size embeddings is that the average mask quality is degraded with holes appearing (which I don't particularly see in unmodified SAM) and more "islands".
In the interest of trying to narrow down why this might be I tried the same image and settings in your cool run_image.py tool here and got similar results.
`run_image.py -m sam2.1_hiera_large.pt`
`run_image.py -m sam2.1_hiera_large.pt -b 3072`
I was hoping you might be able to shed light on whatever i'm missing for this workflow that is giving you such great results on that turtle image.
Thanks for your work on this exploration and for taking the time to read this!
An additional part of this box workflow I enjoy with base SAM is the quick box segmentation and that seems particularly degraded at these larger image sizes.
`run_image.py -m sam2.1_hiera_large.pt`
`run_image.py -m sam2.1_hiera_large.pt -b 3072`
Thanks for the kind words!
Actually my experience with using higher resolutions has been mostly the same. The hi-res turtle result came from an earlier version of the run_image script which only took the largest mask, and so effectively removed lots of 'islands', giving a cleaner looking result than what the script currently shows.
More often than not, the hi-res results end up with artifacts that don't appear at the original 1024px resolution, and so the quality isn't consistently better (though it can be in certain areas, like the tail feathers vs. water example you showed). There are lots of visible artifacts in the SAMv2 raw mask outputs (which can be seen in the window size experiment script), which I think might be responsible for some of these errors, along with issues caused by the windowing within the image encoder.
That being said, there are a few things that can be done to try to improve the results.
- Cropping too closely tends to introduce more artifacts. In the original paper, they mentioned excluding overly large masks from training, which may explain why this happens (not really sure). So cropping 'further out' from the object may help. Of course this trades-off resolution, so not ideal...
- SAMv2 is very sensitive to box prompt positioning. Using tight fitting boxes can make a big difference.
- This isn't true for the avocet image, but sometimes other model sizes can out-perform the large SAMv2 model, especially the base+ model which uses higher-resolution position encodings internally.
- The SAMv2 models have some refinement capability by treating the image as though it's a video sequence (of the same frame, repeating). You can get a sense of this from the cross-image segmentation experiment script, by loading the same image both times on startup. Here's an exaggerated example showing the input prompt/masking on the left and the segmentation result '3 frames later' as if it's video tracking:
(I'll maybe look at adding a save button to this script to be able to get the results out)
Other than that, variants of SAM may do better. SAMv1 seems to consistently handle up-scaling better than the v2 versions. It's limited by it's VRAM usage, but I think the SAMv1-large model at a 1920px resolution (~24GB VRAM) will generally produce better results than the SAMv2-large model at say, 4096px. There are also the SAMHQ models, which can be good for fine details (the v1 HQ models are supported on the feature/samhqv1 branch of this repo). It's maybe also working checking out the medical-imaging variants (e.g. MedSAM2 and Medical-SAM2) since they may be better at handling microscope images.
This is extremely helpful and I really appreciate you taking the time to write it up.