transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Enable dynamic resolution input for Swin Transformer

Open the-neural-networker opened this issue 9 months ago • 5 comments

What does this PR do?

This PR adds the interpolate_pos_encoding feature to the Swin Transformer model, allowing it to handle input images of different resolutions while leveraging existing pre-trained checkpoints.

Addresses #30579.

Changes

The following changes have been made to enable dynamic resolution input for the Swin Transformer model:

  1. Added the interpolate_pos_encoding method to the SwinModel class. This method takes the pre-trained position embeddings and the target height and width as inputs and returns the interpolated position embeddings.

  2. Modified the forward method of the SwinModel class to accept an interpolate_pos_encoding argument. When set to True, the model will interpolate the position embeddings based on the input image size.

  3. Added a test case in test_modeling_swin.py to verify that the model can correctly interpolate position embeddings for input images of different sizes.

Who can review?

@amyeroberts

the-neural-networker avatar May 05 '24 04:05 the-neural-networker

Any suggestions are more than welcome! This is my first open-source contribution!

the-neural-networker avatar May 05 '24 04:05 the-neural-networker

Hi @amyeroberts, thank you and I committed your suggestions! I will add the feature to the copies of Swin by running make fix-copies. I will have the update by the weekend.

the-neural-networker avatar May 09 '24 21:05 the-neural-networker

Hi,

I think Swin Transformer already works on any resolution.

NielsRogge avatar May 10 '24 10:05 NielsRogge

Hi @NielsRogge, yes it looks like dynamic resolution is supported by Swin using the maybe_pad method. Correct me if I am wrong, it looks like the pixel values are padded to achieve dynamic resolution where as with vit the position encodings are interpolated to achieve the same. What should be the next step @amyeroberts? Do we want to still add interpolate_pos_encoding in swin? I think I can still write the tests for dynamic resolution input.

the-neural-networker avatar May 10 '24 20:05 the-neural-networker

@the-neural-networker I checked it, so currently any height and width are supported if divisible by 32. I assume you can continue with supporting interpolation of position embeddings, since that would allow any resolution (needs to be tested of course).

NielsRogge avatar May 11 '24 12:05 NielsRogge

@amyeroberts, @NielsRogge I've been working on implementing the Swin transformer variants and writing their corresponding tests. I noticed that for MaskFormerSwin, there doesn't seem to be a pretrained checkpoint available, and it appears that an integration test hasn't been written for it yet (similar to DonutSwin).

Since MaskFormerSwin is primarily used as a backbone, I'm wondering about the best approach for writing its integration test. Currently, the integration tests for ViT and other models utilize pretrained checkpoints and their associated image processors to test the interpolation functionality.

Given the lack of a pretrained checkpoint for MaskFormerSwin, could you provide some guidance on how to properly test its integration?

See test_modeling_maskformer_swin.py.

the-neural-networker avatar May 13 '24 19:05 the-neural-networker

@the-neural-networker In this case, when there aren't existing checkpoints or integration tests, it's OK for you to skip adding tests for the interpolate method

amyeroberts avatar May 15 '24 15:05 amyeroberts

Hi @amyeroberts,

To clarify, should I:

  1. Add the interpolate method to DonutSwin and MaskFormerSwin, but omit the tests? Or...
  2. Leave DonutSwin and MaskFormerSwin unchanged?

the-neural-networker avatar May 16 '24 18:05 the-neural-networker

@the-neural-networker You can do either. I'd suggest 2, as the feature is unlikely to be used in the case of backbones. We can add it later if it's requested. If you find that you need to add it because of # Copied from comments, then you can add and not add integration tests for them

amyeroberts avatar May 16 '24 18:05 amyeroberts

Thank you for the clarification!

the-neural-networker avatar May 16 '24 18:05 the-neural-networker

It looks like interpolation needs to be added to DonutSwin and MaskFormerSwin because of #Copied from comments. So, I will be adding that, but omitting their tests.

the-neural-networker avatar May 16 '24 19:05 the-neural-networker

Thank you for the thorough review and kind words! I think that is all the changes, so feel free to merge whenever you are ready!

the-neural-networker avatar May 17 '24 16:05 the-neural-networker