muggled_sam icon indicating copy to clipboard operation
muggled_sam copied to clipboard

Non-square (rectangular) image input?

Open horsto opened this issue 10 months ago • 4 comments

Hey, I thought you might know the answer: Is it possible to feed in non-square images into prediction? If you have videos that are non-square, what is the best way to handle that?

Feeding in rectangular images I run into errors in sam2/modeling/backbones/hieradet.py in line 277 (_get_pos_embed), with

RuntimeError: The size of tensor a (158) must match the size of tensor b (152) at non-singleton dimension 2

presumably because the integer rounding of non square dimensions runs into problems there. But I guess the problem might be deeper, i.e. maybe non square is not allowed at all?

horsto avatar Feb 16 '25 02:02 horsto

Specifically, this happens in SAM2VideoPredictor - should have mentioned that.

horsto avatar Feb 16 '25 02:02 horsto

For now I am just "forcing" a resize to square... and that still works because the masks get resized at the end of the prediction. And if the aspect ratio is not too far from 1 I guess it can still be near optimal?

horsto avatar Feb 16 '25 15:02 horsto

The SAMv2 models can run on non-square images/video, though the side lengths of the frames have to be multiples of 32, so it's not always possible to exactly match the input aspect ratio. The error you're getting from _get_pos_embed is likely due to the learned pos_embed_window being 8x8 and it trying to tile to match the '158' sizing, which isn't cleanly divisible by 8. Using height/widths which are multiples of 32 should help avoid this error.

That being said, the original code base is hard-coded around the use of square frames, so it would require substantial modifications to properly support non-square frames. As a starting point, you could try searching for the term image_size in the sam2_video_predictor.py and sam2_base.py scripts to see how/where the single size value is used, which would need to be updated to support independent height & width values.

The v2 models are also prone to artifacts/segmentation errors when using non-square frames, like the example below:

Image

These sorts of errors don't always happen, but the potential for it to happen along with the work needed to modify the code base is probably a good enough reason to stick with square frame sizes.

heyoeyo avatar Feb 18 '25 17:02 heyoeyo

Great, thanks for the clear answer, @heyoeyo! I figured out the "divisibility by 8" issue, but didn't get much farther. I could dive deeper into it like you suggest. However, it seems that the square resizing "trick" does work well and gives results that are usable at least.

horsto avatar Feb 18 '25 18:02 horsto