Does the SAM2 implementation need any code modifications to handle different input resolutions?
Thank you so much for your great work! I would like to know if I need to modify the SAM2 code to handle different input resolutions. Which parts of the code should I modify? Also, what is the actual speed difference for different resolutions?
Thanks for checking out the repo!
For the original SAM2 code, the video predictor already supports resolution changes without modifying the code, you just need to change the image_size config parameter. The image predictor does require some (small) changes to make the bb_feat_sizes parameter dynamic. There's more of a description in issue #257.
There's about a 4x speed up going from 1024 down to 512px. Unfortunately, the SAM v2 models don't handle the resolution change very well, so doing intermediate resolutions (e.g 768px) to get better speed/accuracy tradeoff usually isn't an option.