Swin-Transformer
Swin-Transformer copied to clipboard
How to finetune on larger input image size?
If I pre-train Swin-T for 224 input image size. How can I finetune it and get Swin-T for 320 input image size? In your paper, you claimed 384^2 input models are obtained by fine-tuning:
For other resolutions such as 384^2, we fine-tune the models trained at 224^2 resolution, instead of training from scratch, to reduce GPU consumption.
However, in this implementation, the fine-tuning is hard to do because the existence of parameters: relative_position_bias_table and attn_mask. If I change the input size, these two parameters are also changed.
So how can I modify the code to support fine-tuning on larger input image size? Thanks!!
Have you been able to fine-tune the model using image sizes other than 224?
if I understand right, relative_position_bias_table only depends on window size ( 7 for 224 in swin_tiny), so relative_position_bias_table do not need to change. attn_mask is not learnable parameter, which can be implemented in forward process, not in the initialization.
We use bi-cubic interpolation on relative_position_bias_table to deal with larger window size. Will provide related code soon.
Swin V2 has a better approach dealing with different window sizes.
Instructions and configs for fine-tuning on higher resolution can be found here: https://github.com/microsoft/Swin-Transformer/blob/main/get_started.md#fine-tuning-on-higher-resolution
help me! why? KeyError: 'backbone.stages.0.blocks.0.attn.w_msa.relative_position_bias_table'