Non-square (rectangular) image input?
Hey, I thought you might know the answer: Is it possible to feed in non-square images into prediction? If you have videos that are non-square, what is the best way to handle that?
Feeding in rectangular images I run into errors in sam2/modeling/backbones/hieradet.py in line 277 (_get_pos_embed), with
RuntimeError: The size of tensor a (158) must match the size of tensor b (152) at non-singleton dimension 2
presumably because the integer rounding of non square dimensions runs into problems there. But I guess the problem might be deeper, i.e. maybe non square is not allowed at all?
Specifically, this happens in SAM2VideoPredictor - should have mentioned that.
For now I am just "forcing" a resize to square... and that still works because the masks get resized at the end of the prediction. And if the aspect ratio is not too far from 1 I guess it can still be near optimal?
The SAMv2 models can run on non-square images/video, though the side lengths of the frames have to be multiples of 32, so it's not always possible to exactly match the input aspect ratio. The error you're getting from _get_pos_embed is likely due to the learned pos_embed_window being 8x8 and it trying to tile to match the '158' sizing, which isn't cleanly divisible by 8. Using height/widths which are multiples of 32 should help avoid this error.
That being said, the original code base is hard-coded around the use of square frames, so it would require substantial modifications to properly support non-square frames. As a starting point, you could try searching for the term image_size in the sam2_video_predictor.py and sam2_base.py scripts to see how/where the single size value is used, which would need to be updated to support independent height & width values.
The v2 models are also prone to artifacts/segmentation errors when using non-square frames, like the example below:
These sorts of errors don't always happen, but the potential for it to happen along with the work needed to modify the code base is probably a good enough reason to stick with square frame sizes.
Great, thanks for the clear answer, @heyoeyo! I figured out the "divisibility by 8" issue, but didn't get much farther. I could dive deeper into it like you suggest. However, it seems that the square resizing "trick" does work well and gives results that are usable at least.