transformers
transformers copied to clipboard
Enable dynamic resolution input for Swin Transformer
What does this PR do?
This PR adds the interpolate_pos_encoding
feature to the Swin Transformer model, allowing it to handle input images of different resolutions while leveraging existing pre-trained checkpoints.
Addresses #30579.
Changes
The following changes have been made to enable dynamic resolution input for the Swin Transformer model:
-
Added the
interpolate_pos_encoding
method to theSwinModel
class. This method takes the pre-trained position embeddings and the target height and width as inputs and returns the interpolated position embeddings. -
Modified the
forward
method of theSwinModel
class to accept aninterpolate_pos_encoding
argument. When set toTrue
, the model will interpolate the position embeddings based on the input image size. -
Added a test case in
test_modeling_swin.py
to verify that the model can correctly interpolate position embeddings for input images of different sizes.
Who can review?
@amyeroberts
Any suggestions are more than welcome! This is my first open-source contribution!
Hi @amyeroberts, thank you and I committed your suggestions! I will add the feature to the copies of Swin by running make fix-copies
. I will have the update by the weekend.
Hi,
I think Swin Transformer already works on any resolution.
Hi @NielsRogge, yes it looks like dynamic resolution is supported by Swin using the maybe_pad
method. Correct me if I am wrong, it looks like the pixel values are padded to achieve dynamic resolution where as with vit the position encodings are interpolated to achieve the same. What should be the next step @amyeroberts? Do we want to still add interpolate_pos_encoding
in swin? I think I can still write the tests for dynamic resolution input.
@the-neural-networker I checked it, so currently any height and width are supported if divisible by 32. I assume you can continue with supporting interpolation of position embeddings, since that would allow any resolution (needs to be tested of course).
@amyeroberts, @NielsRogge I've been working on implementing the Swin transformer variants and writing their corresponding tests. I noticed that for MaskFormerSwin
, there doesn't seem to be a pretrained checkpoint available, and it appears that an integration test hasn't been written for it yet (similar to DonutSwin
).
Since MaskFormerSwin
is primarily used as a backbone, I'm wondering about the best approach for writing its integration test. Currently, the integration tests for ViT
and other models utilize pretrained checkpoints and their associated image processors to test the interpolation functionality.
Given the lack of a pretrained checkpoint for MaskFormerSwin
, could you provide some guidance on how to properly test its integration?
@the-neural-networker In this case, when there aren't existing checkpoints or integration tests, it's OK for you to skip adding tests for the interpolate method
Hi @amyeroberts,
To clarify, should I:
- Add the interpolate method to
DonutSwin
andMaskFormerSwin
, but omit the tests? Or... - Leave
DonutSwin
andMaskFormerSwin
unchanged?
@the-neural-networker You can do either. I'd suggest 2, as the feature is unlikely to be used in the case of backbones. We can add it later if it's requested. If you find that you need to add it because of # Copied from
comments, then you can add and not add integration tests for them
Thank you for the clarification!
It looks like interpolation needs to be added to DonutSwin
and MaskFormerSwin
because of #Copied from
comments. So, I will be adding that, but omitting their tests.
Thank you for the thorough review and kind words! I think that is all the changes, so feel free to merge whenever you are ready!