SimMIM
SimMIM copied to clipboard
Allow arbitrary-sized images by dynamic masking: upstream changes from Swin-Transformer-Object-Detection / SOLQ
Hi!
To combine Swin transformer backbone with Deformable DETR detector, SOLQ did some changes to swin_transformer.py that allow to compute the padding mask dynamically and allow for arbitrary-sized images in input (I think this is supported for relative positional encoding only).
Similar edits were done by your colleagues in https://github.com/SwinTransformer/Swin-Transformer-Object-Detection/blob/master/mmdet/models/backbones/swin_transformer.py
If this interests you, maybe you could import those edits from SOLQ / Swin-Transformer-Object-Detection or implement similar edits. This will make it simpler to experiment with SimMIM checkpoints / backbone code in object detection context and make sure that checkpoints load correctly.
Hi!
To combine Swin transformer backbone with Deformable DETR detector, SOLQ did some changes to
swin_transformer.pythat allow to compute the padding mask dynamically and allow for arbitrary-sized images in input (I think this is supported for relative positional encoding only).Similar edits were done by your colleagues in https://github.com/SwinTransformer/Swin-Transformer-Object-Detection/blob/master/mmdet/models/backbones/swin_transformer.py
If this interests you, maybe you could import those edits from SOLQ / Swin-Transformer-Object-Detection or implement similar edits. This will make it simpler to experiment with SimMIM checkpoints / backbone code in object detection context and make sure that checkpoints load correctly.
Thanks for your suggestion. We will try to add this support.
Hi!
To combine Swin transformer backbone with Deformable DETR detector, SOLQ did some changes to
swin_transformer.pythat allow to compute the padding mask dynamically and allow for arbitrary-sized images in input (I think this is supported for relative positional encoding only).Similar edits were done by your colleagues in https://github.com/SwinTransformer/Swin-Transformer-Object-Detection/blob/master/mmdet/models/backbones/swin_transformer.py
If this interests you, maybe you could import those edits from SOLQ / Swin-Transformer-Object-Detection or implement similar edits. This will make it simpler to experiment with SimMIM checkpoints / backbone code in object detection context and make sure that checkpoints load correctly.
Hi there, have ever tried to use ViTs in object detection tasks? I'd like to ask if the pipeline of using ViTs as the backbone of object detection algorithms, e.g. Mask R-CNN, is the same as the CNNs: If the images in a batch have different resolutions, we pad them to have a equal size? Should we resize the images to make the width and height to be divisible by the patch size, like 16 or 14, or force them to be squares like 1024*1024. I am curious about how to preprocess the input images before feeding them to the ViT encoder when used in object detection, as we can't simply random resize and crop it to a predefined size like image classification do. Appreciate your reply, thanks!