Swin-Transformer How to finetune on larger input image size？

If I pre-train Swin-T for 224 input image size. How can I finetune it and get Swin-T for 320 input image size? In your paper, you claimed 384^2 input models are obtained by fine-tuning:

For other resolutions such as 384^2, we fine-tune the models trained at 224^2 resolution, instead of training from scratch, to reduce GPU consumption.

However, in this implementation, the fine-tuning is hard to do because the existence of parameters: relative_position_bias_table and attn_mask. If I change the input size, these two parameters are also changed.

So how can I modify the code to support fine-tuning on larger input image size? Thanks!!

Apr 14 '21 08:04 vtddggg

Have you been able to fine-tune the model using image sizes other than 224?

Apr 19 '21 18:04 nauyan

if I understand right, relative_position_bias_table only depends on window size ( 7 for 224 in swin_tiny), so relative_position_bias_table do not need to change. attn_mask is not learnable parameter, which can be implemented in forward process, not in the initialization.

Nov 12 '21 02:11 valencebond

We use bi-cubic interpolation on relative_position_bias_table to deal with larger window size. Will provide related code soon.

Dec 20 '21 09:12 ancientmooner

Swin V2 has a better approach dealing with different window sizes.

Dec 20 '21 09:12 ancientmooner

Instructions and configs for fine-tuning on higher resolution can be found here: https://github.com/microsoft/Swin-Transformer/blob/main/get_started.md#fine-tuning-on-higher-resolution

Dec 20 '21 16:12 zeliu98

help me! why? KeyError: 'backbone.stages.0.blocks.0.attn.w_msa.relative_position_bias_table'

Apr 22 '22 06:04 vothaianh1997