vision Add SwinV2

Fixes #6242

Jul 07 '22 17:07 ain-soph

It seems still not passing the μfmt check, even though I've formatted locally.
https://app.circleci.com/pipelines/github/pytorch/vision/18731/workflows/38887801-2a22-478d-9528-6fdf95b89d4a/jobs/1515423?invite=true#step-104-12

Anyone knows how to solve it?

Jul 07 '22 19:07 ain-soph

And I see an issue from the Microsoft SwinTransformer repo: https://github.com/microsoft/Swin-Transformer/issues/194

It thinks it's not necessary to divide the mask into 9 parts because 4 parts are already enough.
I kind of agree with that. Anyone has opinion on that?

Jul 07 '22 19:07 ain-soph

According to the unittest logs, the state_dict has certain int values rather than torch.Tensor inside? ~~I currently have no clue about this actually.~~

https://app.circleci.com/pipelines/github/pytorch/vision/18731/workflows/91dedbbe-264b-4665-a146-91117a6c8f63/jobs/1515479?invite=true#step-108-2975

Jul 07 '22 19:07 ain-soph

It seems these mod operations are considered as nodes of the fx graph, and their output is int rather than torch.Tensor

Previous SwinTransformer V1 doesn't fail at these tests only because those mod operations are not sampled with that random seed.

While this test file requires it to be torch.Tensor

Jul 08 '22 02:07 ain-soph

@ain-soph Thanks a lot for your contribution!

Concerning the linter, if you check the CI on tab Required lint modifications it will show you what's the problem:

Concerning your point the PatchMerging, I believe you are right. FX doesn't let you use the input of the tensor to control the flow of the program. This needs to move outside of the main function and be declared a non fx-traceable operator. I would patch this ASAP outside of this PR. cc @YosuaMichael

Jul 08 '22 11:07 datumbox

Hi @ain-soph, thanks a lot for the PR! As of now, I am still reading SwinTransformerV2 paper and original code, and I will try to review afterwards.

Meanwhile, let me address some of the issue you raise:

On ufmt issue, can you make sure you install the following version: pip install ufmt==1.3.2 black==21.9b0 usort==0.6.4 (reference)
For the fx issue, I create a small patch: https://github.com/pytorch/vision/pull/6252

Jul 08 '22 16:07 YosuaMichael

I'm currently using v2_logit_scale and use_v2 as 2 variables to check versions. Please comment if you have any better solution.

Jul 08 '22 20:07 ain-soph

@ain-soph Thanks for you good PR. I have two advice:

we should fix the problem in https://github.com/pytorch/vision/pull/6222 ( for input size=256 window size=8, it works well, but for input size=384 window size=12, it has a bug at the last block (it should be 6, but we do padding and use 8))
I think we should also add the pretrained_window_size on the SwinTransformer class, I think it should be List[List[int]] or [List[int] (for all block use the same pretrain window size)

Jul 09 '22 04:07 xiaohu2015

@xiaohu2015 I'll take a look at the problem you mentioned here, and add the pretrained_window_size as the argument (it's zero for all variants though). And I think it should be List[int]? (according to the original code in https://github.com/microsoft/Swin-Transformer/blob/b720b4191588c19222ccf129860e905fb02373a7/models/swin_transformer_v2.py#L526)

Jul 09 '22 04:07 ain-soph

@xiaohu2015 I'll take a look at the problem you mentioned here, and add the pretrained_window_size as the argument (it's zero for all variants though). And I think it should be List[int]? (according to the original code in https://github.com/microsoft/Swin-Transformer/blob/b720b4191588c19222ccf129860e905fb02373a7/models/swin_transformer_v2.py#L526)

In torchvision, we define the window size as List[int], it means the window size of each dimension. The Tuple[int] in offical code means the window size of each block.

Jul 09 '22 04:07 xiaohu2015

@xiaohu2015 I just read the window_size problem you mentioned and agree that the current implementation has some potential risk at some edge cases.

But I think it's a little out of scope and we should open another issue to get that solved.
I'm following the current V1 architecture.
If we plan to pass image_size as argument of Transformer model in the future to limit window size, we can definitely do that in a new PR as a patch fix.

Jul 09 '22 05:07 ain-soph

The evaluation log of Swin_V2_T using pretrained weights from official repo:

torchrun --nproc_per_node=4 train.py --model swin_v2_t --test-only --weights Swin_V2_T_Weights.IMAGENET1K_V1
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
| distributed init (rank 3): env://
| distributed init (rank 1): env://
| distributed init (rank 0): env://
| distributed init (rank 2): env://
Namespace(data_path='/data/user/data/image/imagenet', model='swin_v2_t', device='cuda', batch_size=32, epochs=90, workers=16, opt='sgd', lr=0.1, momentum=0.9, weight_decay=0.0001, norm_weight_decay=None, bias_weight_decay=None, transformer_embedding_decay=None, label_smoothing=0.0, mixup_alpha=0.0, cutmix_alpha=0.0, lr_scheduler='steplr', lr_warmup_epochs=0, lr_warmup_method='constant', lr_warmup_decay=0.01, lr_step_size=30, lr_gamma=0.1, lr_min=0.0, print_freq=10, output_dir='.', resume='', start_epoch=0, cache_dataset=False, sync_bn=False, test_only=True, auto_augment=None, random_erase=0.0, amp=False, world_size=4, dist_url='env://', model_ema=False, model_ema_steps=32, model_ema_decay=0.99998, use_deterministic_algorithms=False, interpolation='bilinear', val_resize_size=256, val_crop_size=224, train_crop_size=224, clip_grad_norm=None, ra_sampler=False, ra_reps=3, weights='Swin_V2_T_Weights.IMAGENET1K_V1', rank=0, gpu=0, distributed=True, dist_backend='nccl')
Loading data
Loading training data
Took 3.2938268184661865
Loading validation data
Creating data loaders
Creating model
/home/user/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py:9: UserWarning: is_namedtuple is deprecated, please use the python checks instead
  warnings.warn("is_namedtuple is deprecated, please use the python checks instead")
/home/user/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py:9: UserWarning: is_namedtuple is deprecated, please use the python checks instead
  warnings.warn("is_namedtuple is deprecated, please use the python checks instead")
/home/user/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py:9: UserWarning: is_namedtuple is deprecated, please use the python checks instead
  warnings.warn("is_namedtuple is deprecated, please use the python checks instead")
/home/user/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py:9: UserWarning: is_namedtuple is deprecated, please use the python checks instead
  warnings.warn("is_namedtuple is deprecated, please use the python checks instead")
Test:   [  0/391]  eta: 0:28:16  loss: 0.2255 (0.2255)  acc1: 96.8750 (96.8750)  acc5: 100.0000 (100.0000)  time: 4.3390  data: 2.0224  max mem: 817
Test:   [100/391]  eta: 0:01:28  loss: 0.6355 (0.6686)  acc1: 87.5000 (86.2314)  acc5: 96.8750 (97.4938)  time: 0.2590  data: 0.1852  max mem: 817
Test:   [200/391]  eta: 0:00:57  loss: 1.1148 (0.7348)  acc1: 71.8750 (84.2973)  acc5: 93.7500 (97.0771)  time: 0.2569  data: 0.1827  max mem: 817
Test:   [300/391]  eta: 0:00:27  loss: 0.9935 (0.8109)  acc1: 78.1250 (82.6100)  acc5: 93.7500 (96.2417)  time: 0.3584  data: 0.2839  max mem: 817
Test:  Total time: 0:02:00
Test:  Acc@1 81.644 Acc@5 95.858

Here is the transform:

        transforms=partial(
            ImageClassification, crop_size=256, resize_size=256, interpolation=InterpolationMode.BILINEAR
        ),

Here is the python script to transform official pretrained weights to the supported version of our implemented version:

from collections import OrderedDict
import torch

_dict: OrderedDict[str, torch.Tensor] = torch.load("swinv2_tiny_patch4_window8_256.pth")["model"]
new_dict = OrderedDict()

for key, value in _dict.items():
    k = key
    v = value
    if 'attn_mask' in k:
        continue
    if 'relative_position_index' in k:
        v = v.flatten()
    if k.startswith('patch_embed'):
        k = k.replace('patch_embed.proj', 'features.0.0')
        k = k.replace('patch_embed.norm', 'features.0.2')
    elif k.startswith('layers'):
        k = k.removeprefix('layers.')
        if 'blocks' in k:
            k = f'features.{2*int(k[0]) + 1:d}.'+k[2:].removeprefix('blocks.')
            # k = k.replace('logit_scale', 'v2_logit_scale')
            if k.endswith('q_bias'):
                v_bias = _dict[key[:-6]+'v_bias']
                k = k.replace('q_bias', 'qkv.bias')
                assert value.numel() == v_bias.numel()
                v = torch.cat([value, torch.zeros_like(value), v_bias])
            elif k.endswith('v_bias'):
                continue
            if '.mlp.' in k:
                k = k.replace('fc1', '0')
                k = k.replace('fc2', '3')
        else:
            assert 'downsample' in k, k
            k = f'features.{2*(int(k[0]) + 1):d}.'+k[2:].removeprefix('downsample.')
    new_dict[k] = v
torch.save(new_dict, './test.pth')

Jul 10 '22 03:07 ain-soph

@YosuaMichael and I have some concerns about the argument window_size.

All variants of previous Swin Transformer V1 are sharing the same window_size=[7, 7]. Therefore, the current codes hardcode it at swin_t, swin_s and swin_b construct functions.

However, for Swin V2, there are 3 window sizes available: 8, 16, 24. (We may don't need to consider 24 though.)
In this case, shall we make window_size as a mutable argument for function swin_v2_t?

If so, there will raise further concern about the pretrained weights:

How do we store pretrained weights in Swin_V2_T_Weights(WeightsEnum)?
Shall we store weights of different window sizes into different WeightsEnums?

@YosuaMichael also points out that considering legacy usage pretrained=True should always work, swin_v2_t(window_size, pretrained=True, weights: Optional[Swin_V2_T_Weights]=None) should also be able to load weights for different window_size

Jul 11 '22 18:07 ain-soph

I'm currently training a swin_v2_t using following command.

torchrun --nproc_per_node=4 train.py --model swin_v2_t --epochs 300 --batch-size 128 --opt adamw --lr 0.001 --weight-decay 0.05 --norm-weight-decay 0.0  --bias-weight-decay 0.0 --transformer-embedding-decay 0.0 --lr-scheduler cosineannealinglr --lr-min 0.00001 --lr-warmup-method linear  --lr-warmup-epochs 20 --lr-warmup-decay 0.01 --amp --label-smoothing 0.1 --mixup-alpha 0.8 --clip-grad-norm 5.0 --cutmix-alpha 1.0 --random-erase 0.25 --interpolation bicubic --auto-augment ta_wide --model-ema --ra-sampler --ra-reps 4  --val-resize-size 256 --val-crop-size 256 --train-crop-size 256

Jul 11 '22 22:07 ain-soph

@xiaohu2015 could you elaborate more on the bug detail that you mention:

we should fix the problem in https://github.com/pytorch/vision/pull/6222 ( for input size=256 window size=8, it works well, but for input size=384 window size=12, it has a bug at the last block (it should be 6, but we do padding and use 8))

To be specific, can you explain for input_size=384 and window_size=12, what is the bug on the last block. And what do you mean by it should be 6, but we do padding and use 8 (what this number refer for).

Do you think the implementation for swin_transformer in here: https://github.com/SwinTransformer/Swin-Transformer-Object-Detection/blob/461e003166a8083d0b620beacd4662a2df306bd6/mmdet/models/backbones/swin_transformer.py#L405 resolve this issue? (I get this repo from https://github.com/pytorch/vision/issues/6227 and maybe misunderstand what it meant by dynamic resolution)

Jul 12 '22 21:07 YosuaMichael

@YosuaMichael and I are considering deprecating argument pretrained_window_size because it seems not used in standard variants. It's an argument about some fine-tuning related ablation studies in the original paper.

But this need some discussion. Any opinion on this? @xiaohu2015

Jul 12 '22 22:07 ain-soph

Hi @ain-soph, as discussed offline I am happy to take the review from here, and support you in the verification of the model weights that will be shipped.

Jul 15 '22 10:07 jdsgomes

This is my training result of swin_v2_t using the command from https://github.com/pytorch/vision/tree/main/references/classification#swintransformer

Epoch: [299] Total time: 0:20:24
Test:   [ 0/98]  eta: 0:06:43  loss: 1.2362 (1.2362)  acc1: 93.7500 (93.7500)  acc5: 98.4375 (98.4375)  time: 4.1221  data: 3.8947  max mem: 14880
Test:  Total time: 0:00:26
Test:  Acc@1 81.560 Acc@5 95.892
Test: EMA  [ 0/98]  eta: 0:06:01  loss: 1.2378 (1.2378)  acc1: 93.7500 (93.7500)  acc5: 98.4375 (98.4375)  time: 3.6888  data: 3.4649  max mem: 14880
Test: EMA Total time: 0:00:25
Test: EMA Acc@1 81.614 Acc@5 95.930
Training time 4 days, 13:34:44

Compared with the official result 81.8%, I think our result (81.56%, while 81.61 after EMA) is comparable.

I think the main reason of the difference may come from the resolution size. Their input size is (256) while ours is (224).
I'm considering adding two training arguments: --train-crop-size=256 --val-resize-size=256

Jul 20 '22 04:07 ain-soph

@jdsgomes I've added the small and base variants in the previous commit. Please train these two variants as well.

Btw, do you think the current tiny variant performance is acceptable?

Jul 20 '22 04:07 ain-soph

Actually, maybe we can port the weights directly instead of training.

I've done such work and validated it at https://github.com/pytorch/vision/pull/6246#issuecomment-1179650923

Jul 20 '22 04:07 ain-soph

I am going to launch the other training variants today and hopefully they will be ready by Monday. Regarding the difference in resolution size we can do a post training optimisation similar to what what we did in swin v1. There is no need for retraining we can just try to find the optimal val-resize-size

Jul 20 '22 11:07 jdsgomes

I am going to launch the other training variants today and hopefully they will be ready by Monday. Regarding the difference in resolution size we can do a post training optimisation similar to what what we did in swin v1. There is no need for retraining we can just try to find the optimal val-resize-size

As discussed offline my train runs failed to converge so we will work together with @ain-soph to find the source of the problem and update the PR once we managed to train all variants successfully.

Jul 25 '22 13:07 jdsgomes

I've launched another train based on the most up-to-date code for tiny architecture.
Currently I don't observe any strange trend showing failure of convergence.

Epoch: [142]  [2500/2502]  eta: 0:00:00  lr: 0.0006042917608127198  img/s: 266.4080904318139  loss: 3.6911 (3.6364)  acc1: 48.4375 (48.2548)  acc5: 73.4375 (71.9340)  time: 0.4798  data: 0.0001  max mem: 14880
Epoch: [142] Total time: 0:20:10
Test:   [ 0/98]  eta: 0:06:25  loss: 1.5864 (1.5864)  acc1: 82.0312 (82.0312)  acc5: 96.0938 (96.0938)  time: 3.9314  data: 3.7073  max mem: 14880
Test:  Total time: 0:00:25
Test:  Acc@1 71.494 Acc@5 91.496
Test: EMA  [ 0/98]  eta: 0:05:43  loss: 1.4705 (1.4705)  acc1: 88.2812 (88.2812)  acc5: 97.6562 (97.6562)  time: 3.5061  data: 3.2585  max mem: 14880
Test: EMA Total time: 0:00:25
Test: EMA Acc@1 76.420 Acc@5 93.660

Aug 04 '22 00:08 ain-soph

@ain-soph Could you please share the exact command you use to train it to ensure we are not missing anything from our side?

Aug 04 '22 07:08 datumbox

Ooops, there are also memory issues on GPU both on linux and windows. I know that a different test is actually failing but this can happen. Sometimes adding a bing new model leads to issues on other models because of failing to clear the memory properly. Usually the way around this is to reduce the memory footprint of the test by passing thought smaller sizes for input or disabling the particular test on the GPU.

Aug 04 '22 07:08 datumbox

torchrun --nproc_per_node=4 train.py --model swin_v2_t --epochs 300 --batch-size 128 --opt adamw --lr 0.001 --weight-decay 0.05 --norm-weight-decay 0.0  --bias-weight-decay 0.0 --transformer-embedding-decay 0.0 --lr-scheduler cosineannealinglr --lr-min 0.00001 --lr-warmup-method linear  --lr-warmup-epochs 20 --lr-warmup-decay 0.01 --amp --label-smoothing 0.1 --mixup-alpha 0.8 --clip-grad-norm 5.0 --cutmix-alpha 1.0 --random-erase 0.25 --interpolation bicubic --auto-augment ta_wide --model-ema --ra-sampler --ra-reps 4 --val-resize-size 256 --val-crop-size 256 --train-crop-size 256

Epoch: [221] Total time: 0:25:25
Test:   [ 0/98]  eta: 0:06:23  loss: 1.3831 (1.3831)  acc1: 91.4062 (91.4062)  acc5: 98.4375 (98.4375)  time: 3.9118 data: 3.6845  max mem: 14880
Test:  Total time: 0:00:25
Test:  Acc@1 77.738 Acc@5 94.460
Test: EMA  [ 0/98]  eta: 0:05:15  loss: 1.3172 (1.3172)  acc1: 90.6250 (90.6250)  acc5: 99.2188 (99.2188)  time: 3.2230 data: 2.9882  max mem: 14880
Test: EMA Total time: 0:00:25
Test: EMA Acc@1 80.012 Acc@5 95.286

Aug 05 '22 05:08 ain-soph

torchrun --nproc_per_node=4 train.py --model swin_v2_t --epochs 300 --batch-size 128 --opt adamw --lr 0.001 --weight-decay 0.05 --norm-weight-decay 0.0  --bias-weight-decay 0.0 --transformer-embedding-decay 0.0 --lr-scheduler cosineannealinglr --lr-min 0.00001 --lr-warmup-method linear  --lr-warmup-epochs 20 --lr-warmup-decay 0.01 --amp --label-smoothing 0.1 --mixup-alpha 0.8 --clip-grad-norm 5.0 --cutmix-alpha 1.0 --random-erase 0.25 --interpolation bicubic --auto-augment ta_wide --model-ema --ra-sampler --ra-reps 4 --val-resize-size 256 --val-crop-size 256 --train-crop-size 256

Epoch: [221] Total time: 0:25:25
Test:   [ 0/98]  eta: 0:06:23  loss: 1.3831 (1.3831)  acc1: 91.4062 (91.4062)  acc5: 98.4375 (98.4375)  time: 3.9118 data: 3.6845  max mem: 14880
Test:  Total time: 0:00:25
Test:  Acc@1 77.738 Acc@5 94.460
Test: EMA  [ 0/98]  eta: 0:05:15  loss: 1.3172 (1.3172)  acc1: 90.6250 (90.6250)  acc5: 99.2188 (99.2188)  time: 3.2230 data: 2.9882  max mem: 14880
Test: EMA Total time: 0:00:25
Test: EMA Acc@1 80.012 Acc@5 95.286

Thank you for sharing the commands. Although you previously mention I missed that in v2 they used a different resolution size. I launched new jobs to train all the variants with the correct resolution size and seems are looking better now.

Aug 05 '22 12:08 jdsgomes

@ain-soph thank you for your patience and great work! I managed to train all three variants and the results look good.

My plan is to update this PR in the next couple of ours with the model weights.

Aug 09 '22 12:08 jdsgomes

Trainning commands:

# swin_v2_t
python -u run_with_submitit.py --timeout 3000 --ngpus 8 --nodes 1  --model swin_v2_t --epochs 300 \
--batch-size 128 --opt adamw --lr 0.001 --weight-decay 0.05 --norm-weight-decay 0.0  \
--bias-weight-decay 0.0 --transformer-embedding-decay 0.0 --lr-scheduler cosineannealinglr \
--lr-min 0.00001 --lr-warmup-method linear  --lr-warmup-epochs 20 --lr-warmup-decay 0.01 \
--amp --label-smoothing 0.1 --mixup-alpha 0.8 --clip-grad-norm 5.0 --cutmix-alpha 1.0 \
--random-erase 0.25 --interpolation bicubic --auto-augment ta_wide --model-ema --ra-sampler \
--ra-reps 4  --val-resize-size 256 --val-crop-size 256 --train-crop-size 256 

# swin_v2_s
python -u run_with_submitit.py --timeout 3000 --ngpus 8 --nodes 1  --model swin_v2_s --epochs 300 \
--batch-size 128 --opt adamw --lr 0.001 --weight-decay 0.05 --norm-weight-decay 0.0  \
--bias-weight-decay 0.0 --transformer-embedding-decay 0.0 --lr-scheduler cosineannealinglr \
--lr-min 0.00001 --lr-warmup-method linear  --lr-warmup-epochs 20 --lr-warmup-decay 0.01 --amp --label-smoothing 0.1 --mixup-alpha 0.8 --clip-grad-norm 5.0 --cutmix-alpha 1.0 \
--random-erase 0.25 --interpolation bicubic --auto-augment ta_wide --model-ema --ra-sampler \
--ra-reps 4  --val-resize-size 256 --val-crop-size 256 --train-crop-size 256

# swin_v2_b
python -u run_with_submitit.py --timeout 3000 --ngpus 8 --nodes 1  --model swin_v2_b --epochs 300 \
--batch-size 128 --opt adamw --lr 0.001 --weight-decay 0.05 --norm-weight-decay 0.0  \
--bias-weight-decay 0.0 --transformer-embedding-decay 0.0 --lr-scheduler cosineannealinglr \
--lr-min 0.00001 --lr-warmup-method linear  --lr-warmup-epochs 20 --lr-warmup-decay 0.01 \
--amp --label-smoothing 0.1 --mixup-alpha 0.8 --clip-grad-norm 5.0 --cutmix-alpha 1.0 \
--random-erase 0.25 --interpolation bicubic --auto-augment ta_wide --model-ema --ra-sampler \
--ra-reps 4  --val-resize-size 256 --val-crop-size 256 --train-crop-size 256

Test commands and acuracies

# swin_v2_t
srun -p dev --cpus-per-task=96 -t 24:00:00 --gpus-per-node=1 torchrun --nproc_per_node=1 train.py     \
    --model swin_v2_t --test-only --resume $EXPERIMENTS_PATH/44757/model_299.pth --interpolation bicubic  \
    --val-resize-size 260 --val-crop-size 256
# Test:  Acc@1 82.072 Acc@5 96.132

# swin_v2_s
srun -p dev --cpus-per-task=96 -t 24:00:00 --gpus-per-node=1 torchrun --nproc_per_node=1 train.py     \
    --model swin_v2_s --test-only --resume $EXPERIMENTS_PATH/44758/model_299.pth --interpolation bicubic   \
    --val-resize-size $260--val-crop-size 256
# Test:  Acc@1 83.712 Acc@5 96.816

# swin_v2_b
srun -p dev --cpus-per-task=96 -t 24:00:00 --gpus-per-node=1 torchrun --nproc_per_node=1 train.py     \
    --model swin_v2_b --test-only --resume $EXPERIMENTS_PATH/44759/model_299.pth --interpolation bicubic   \
    --val-resize-size 272 --val-crop-size 256
# Test:  Acc@1 84.112 Acc@5 96.864

Aug 09 '22 13:08 jdsgomes

I guess the only final thing on our plate is to fix the memory issue. I’ll work on it in the following week.

Btw, should we provide porting model weights from official repo as an alternative?

Aug 09 '22 14:08 ain-soph

@ain-soph @jdsgomes Awesome work! Having SwinV2 in TorchVision is really awesome. Looking forward using them.

Btw, should we provide porting model weights from official repo as an alternative?

I think given we were able to reproduce the accuracy of the paper, there is no point offering both. What would be interesting on the future is to offer higher accuracy weights by using newer training recipes.

Aug 09 '22 16:08 datumbox

vision vision copied to clipboard

Add SwinV2

vision
vision copied to clipboard