vision
vision copied to clipboard
Add SwinV2
Fixes #6242
It seems still not passing the μfmt check, even though I've formatted locally.
https://app.circleci.com/pipelines/github/pytorch/vision/18731/workflows/38887801-2a22-478d-9528-6fdf95b89d4a/jobs/1515423?invite=true#step-104-12
Anyone knows how to solve it?
And I see an issue from the Microsoft SwinTransformer repo: https://github.com/microsoft/Swin-Transformer/issues/194
It thinks it's not necessary to divide the mask into 9 parts because 4 parts are already enough.
I kind of agree with that. Anyone has opinion on that?
According to the unittest logs, the state_dict has certain int
values rather than torch.Tensor
inside? ~~I currently have no clue about this actually.~~
https://app.circleci.com/pipelines/github/pytorch/vision/18731/workflows/91dedbbe-264b-4665-a146-91117a6c8f63/jobs/1515479?invite=true#step-108-2975
It seems these mod operations are considered as nodes of the fx graph, and their output is int
rather than torch.Tensor
Previous SwinTransformer V1 doesn't fail at these tests only because those mod operations are not sampled with that random seed.
While this test file requires it to be torch.Tensor
@ain-soph Thanks a lot for your contribution!
Concerning the linter, if you check the CI on tab Required lint modifications
it will show you what's the problem:
Concerning your point the PatchMerging, I believe you are right. FX doesn't let you use the input of the tensor to control the flow of the program. This needs to move outside of the main function and be declared a non fx-traceable operator. I would patch this ASAP outside of this PR. cc @YosuaMichael
Hi @ain-soph, thanks a lot for the PR! As of now, I am still reading SwinTransformerV2 paper and original code, and I will try to review afterwards.
Meanwhile, let me address some of the issue you raise:
- On ufmt issue, can you make sure you install the following version:
pip install ufmt==1.3.2 black==21.9b0 usort==0.6.4
(reference) - For the fx issue, I create a small patch: https://github.com/pytorch/vision/pull/6252
I'm currently using v2_logit_scale
and use_v2
as 2 variables to check versions. Please comment if you have any better solution.
@ain-soph Thanks for you good PR. I have two advice:
- we should fix the problem in https://github.com/pytorch/vision/pull/6222 ( for input size=256 window size=8, it works well, but for input size=384 window size=12, it has a bug at the last block (it should be 6, but we do padding and use 8))
- I think we should also add the
pretrained_window_size
on theSwinTransformer
class, I think it should beList[List[int]]
or[List[int]
(for all block use the same pretrain window size)
@xiaohu2015 I'll take a look at the problem you mentioned here, and add the pretrained_window_size
as the argument (it's zero for all variants though). And I think it should be List[int]
? (according to the original code in https://github.com/microsoft/Swin-Transformer/blob/b720b4191588c19222ccf129860e905fb02373a7/models/swin_transformer_v2.py#L526)
@xiaohu2015 I'll take a look at the problem you mentioned here, and add the
pretrained_window_size
as the argument (it's zero for all variants though). And I think it should beList[int]
? (according to the original code in https://github.com/microsoft/Swin-Transformer/blob/b720b4191588c19222ccf129860e905fb02373a7/models/swin_transformer_v2.py#L526)
In torchvision, we define the window size as List[int]
, it means the window size of each dimension. The Tuple[int]
in offical code means the window size of each block.
@xiaohu2015 I just read the window_size
problem you mentioned and agree that the current implementation has some potential risk at some edge cases.
But I think it's a little out of scope and we should open another issue to get that solved.
I'm following the current V1 architecture.
If we plan to pass image_size
as argument of Transformer
model in the future to limit window size, we can definitely do that in a new PR as a patch fix.
The evaluation log of Swin_V2_T using pretrained weights from official repo:
torchrun --nproc_per_node=4 train.py --model swin_v2_t --test-only --weights Swin_V2_T_Weights.IMAGENET1K_V1
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
| distributed init (rank 3): env://
| distributed init (rank 1): env://
| distributed init (rank 0): env://
| distributed init (rank 2): env://
Namespace(data_path='/data/user/data/image/imagenet', model='swin_v2_t', device='cuda', batch_size=32, epochs=90, workers=16, opt='sgd', lr=0.1, momentum=0.9, weight_decay=0.0001, norm_weight_decay=None, bias_weight_decay=None, transformer_embedding_decay=None, label_smoothing=0.0, mixup_alpha=0.0, cutmix_alpha=0.0, lr_scheduler='steplr', lr_warmup_epochs=0, lr_warmup_method='constant', lr_warmup_decay=0.01, lr_step_size=30, lr_gamma=0.1, lr_min=0.0, print_freq=10, output_dir='.', resume='', start_epoch=0, cache_dataset=False, sync_bn=False, test_only=True, auto_augment=None, random_erase=0.0, amp=False, world_size=4, dist_url='env://', model_ema=False, model_ema_steps=32, model_ema_decay=0.99998, use_deterministic_algorithms=False, interpolation='bilinear', val_resize_size=256, val_crop_size=224, train_crop_size=224, clip_grad_norm=None, ra_sampler=False, ra_reps=3, weights='Swin_V2_T_Weights.IMAGENET1K_V1', rank=0, gpu=0, distributed=True, dist_backend='nccl')
Loading data
Loading training data
Took 3.2938268184661865
Loading validation data
Creating data loaders
Creating model
/home/user/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py:9: UserWarning: is_namedtuple is deprecated, please use the python checks instead
warnings.warn("is_namedtuple is deprecated, please use the python checks instead")
/home/user/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py:9: UserWarning: is_namedtuple is deprecated, please use the python checks instead
warnings.warn("is_namedtuple is deprecated, please use the python checks instead")
/home/user/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py:9: UserWarning: is_namedtuple is deprecated, please use the python checks instead
warnings.warn("is_namedtuple is deprecated, please use the python checks instead")
/home/user/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py:9: UserWarning: is_namedtuple is deprecated, please use the python checks instead
warnings.warn("is_namedtuple is deprecated, please use the python checks instead")
Test: [ 0/391] eta: 0:28:16 loss: 0.2255 (0.2255) acc1: 96.8750 (96.8750) acc5: 100.0000 (100.0000) time: 4.3390 data: 2.0224 max mem: 817
Test: [100/391] eta: 0:01:28 loss: 0.6355 (0.6686) acc1: 87.5000 (86.2314) acc5: 96.8750 (97.4938) time: 0.2590 data: 0.1852 max mem: 817
Test: [200/391] eta: 0:00:57 loss: 1.1148 (0.7348) acc1: 71.8750 (84.2973) acc5: 93.7500 (97.0771) time: 0.2569 data: 0.1827 max mem: 817
Test: [300/391] eta: 0:00:27 loss: 0.9935 (0.8109) acc1: 78.1250 (82.6100) acc5: 93.7500 (96.2417) time: 0.3584 data: 0.2839 max mem: 817
Test: Total time: 0:02:00
Test: Acc@1 81.644 Acc@5 95.858
Here is the transform:
transforms=partial(
ImageClassification, crop_size=256, resize_size=256, interpolation=InterpolationMode.BILINEAR
),
Here is the python script to transform official pretrained weights to the supported version of our implemented version:
from collections import OrderedDict
import torch
_dict: OrderedDict[str, torch.Tensor] = torch.load("swinv2_tiny_patch4_window8_256.pth")["model"]
new_dict = OrderedDict()
for key, value in _dict.items():
k = key
v = value
if 'attn_mask' in k:
continue
if 'relative_position_index' in k:
v = v.flatten()
if k.startswith('patch_embed'):
k = k.replace('patch_embed.proj', 'features.0.0')
k = k.replace('patch_embed.norm', 'features.0.2')
elif k.startswith('layers'):
k = k.removeprefix('layers.')
if 'blocks' in k:
k = f'features.{2*int(k[0]) + 1:d}.'+k[2:].removeprefix('blocks.')
# k = k.replace('logit_scale', 'v2_logit_scale')
if k.endswith('q_bias'):
v_bias = _dict[key[:-6]+'v_bias']
k = k.replace('q_bias', 'qkv.bias')
assert value.numel() == v_bias.numel()
v = torch.cat([value, torch.zeros_like(value), v_bias])
elif k.endswith('v_bias'):
continue
if '.mlp.' in k:
k = k.replace('fc1', '0')
k = k.replace('fc2', '3')
else:
assert 'downsample' in k, k
k = f'features.{2*(int(k[0]) + 1):d}.'+k[2:].removeprefix('downsample.')
new_dict[k] = v
torch.save(new_dict, './test.pth')
@YosuaMichael and I have some concerns about the argument window_size
.
All variants of previous Swin Transformer V1 are sharing the same window_size=[7, 7]
. Therefore, the current codes hardcode it at swin_t
, swin_s
and swin_b
construct functions.
However, for Swin V2, there are 3 window sizes available: 8, 16, 24
. (We may don't need to consider 24
though.)
In this case, shall we make window_size
as a mutable argument for function swin_v2_t
?
If so, there will raise further concern about the pretrained weights:
- How do we store pretrained weights in
Swin_V2_T_Weights(WeightsEnum)
? - Shall we store weights of different window sizes into different WeightsEnums?
@YosuaMichael also points out that considering legacy usage pretrained=True
should always work, swin_v2_t(window_size, pretrained=True, weights: Optional[Swin_V2_T_Weights]=None)
should also be able to load weights for different window_size
I'm currently training a swin_v2_t
using following command.
torchrun --nproc_per_node=4 train.py --model swin_v2_t --epochs 300 --batch-size 128 --opt adamw --lr 0.001 --weight-decay 0.05 --norm-weight-decay 0.0 --bias-weight-decay 0.0 --transformer-embedding-decay 0.0 --lr-scheduler cosineannealinglr --lr-min 0.00001 --lr-warmup-method linear --lr-warmup-epochs 20 --lr-warmup-decay 0.01 --amp --label-smoothing 0.1 --mixup-alpha 0.8 --clip-grad-norm 5.0 --cutmix-alpha 1.0 --random-erase 0.25 --interpolation bicubic --auto-augment ta_wide --model-ema --ra-sampler --ra-reps 4 --val-resize-size 256 --val-crop-size 256 --train-crop-size 256
@xiaohu2015 could you elaborate more on the bug detail that you mention:
we should fix the problem in https://github.com/pytorch/vision/pull/6222 ( for input size=256 window size=8, it works well, but for input size=384 window size=12, it has a bug at the last block (it should be 6, but we do padding and use 8))
To be specific, can you explain for input_size=384 and window_size=12, what is the bug on the last block. And what do you mean by it should be 6, but we do padding and use 8
(what this number refer for).
Do you think the implementation for swin_transformer in here: https://github.com/SwinTransformer/Swin-Transformer-Object-Detection/blob/461e003166a8083d0b620beacd4662a2df306bd6/mmdet/models/backbones/swin_transformer.py#L405 resolve this issue? (I get this repo from https://github.com/pytorch/vision/issues/6227 and maybe misunderstand what it meant by dynamic resolution)
@YosuaMichael and I are considering deprecating argument pretrained_window_size
because it seems not used in standard variants. It's an argument about some fine-tuning related ablation studies in the original paper.
But this need some discussion. Any opinion on this? @xiaohu2015
Hi @ain-soph, as discussed offline I am happy to take the review from here, and support you in the verification of the model weights that will be shipped.
This is my training result of swin_v2_t
using the command from https://github.com/pytorch/vision/tree/main/references/classification#swintransformer
Epoch: [299] Total time: 0:20:24
Test: [ 0/98] eta: 0:06:43 loss: 1.2362 (1.2362) acc1: 93.7500 (93.7500) acc5: 98.4375 (98.4375) time: 4.1221 data: 3.8947 max mem: 14880
Test: Total time: 0:00:26
Test: Acc@1 81.560 Acc@5 95.892
Test: EMA [ 0/98] eta: 0:06:01 loss: 1.2378 (1.2378) acc1: 93.7500 (93.7500) acc5: 98.4375 (98.4375) time: 3.6888 data: 3.4649 max mem: 14880
Test: EMA Total time: 0:00:25
Test: EMA Acc@1 81.614 Acc@5 95.930
Training time 4 days, 13:34:44
Compared with the official result 81.8%, I think our result (81.56%, while 81.61 after EMA) is comparable.
I think the main reason of the difference may come from the resolution size. Their input size is (256) while ours is (224).
I'm considering adding two training arguments: --train-crop-size=256 --val-resize-size=256
@jdsgomes I've added the small and base variants in the previous commit. Please train these two variants as well.
Btw, do you think the current tiny variant performance is acceptable?
Actually, maybe we can port the weights directly instead of training.
I've done such work and validated it at https://github.com/pytorch/vision/pull/6246#issuecomment-1179650923
I am going to launch the other training variants today and hopefully they will be ready by Monday. Regarding the difference in resolution size we can do a post training optimisation similar to what what we did in swin v1. There is no need for retraining we can just try to find the optimal val-resize-size
I am going to launch the other training variants today and hopefully they will be ready by Monday. Regarding the difference in resolution size we can do a post training optimisation similar to what what we did in swin v1. There is no need for retraining we can just try to find the optimal val-resize-size
As discussed offline my train runs failed to converge so we will work together with @ain-soph to find the source of the problem and update the PR once we managed to train all variants successfully.
I've launched another train based on the most up-to-date code for tiny architecture.
Currently I don't observe any strange trend showing failure of convergence.
Epoch: [142] [2500/2502] eta: 0:00:00 lr: 0.0006042917608127198 img/s: 266.4080904318139 loss: 3.6911 (3.6364) acc1: 48.4375 (48.2548) acc5: 73.4375 (71.9340) time: 0.4798 data: 0.0001 max mem: 14880
Epoch: [142] Total time: 0:20:10
Test: [ 0/98] eta: 0:06:25 loss: 1.5864 (1.5864) acc1: 82.0312 (82.0312) acc5: 96.0938 (96.0938) time: 3.9314 data: 3.7073 max mem: 14880
Test: Total time: 0:00:25
Test: Acc@1 71.494 Acc@5 91.496
Test: EMA [ 0/98] eta: 0:05:43 loss: 1.4705 (1.4705) acc1: 88.2812 (88.2812) acc5: 97.6562 (97.6562) time: 3.5061 data: 3.2585 max mem: 14880
Test: EMA Total time: 0:00:25
Test: EMA Acc@1 76.420 Acc@5 93.660
@ain-soph Could you please share the exact command you use to train it to ensure we are not missing anything from our side?
Ooops, there are also memory issues on GPU both on linux and windows. I know that a different test is actually failing but this can happen. Sometimes adding a bing new model leads to issues on other models because of failing to clear the memory properly. Usually the way around this is to reduce the memory footprint of the test by passing thought smaller sizes for input or disabling the particular test on the GPU.
torchrun --nproc_per_node=4 train.py --model swin_v2_t --epochs 300 --batch-size 128 --opt adamw --lr 0.001 --weight-decay 0.05 --norm-weight-decay 0.0 --bias-weight-decay 0.0 --transformer-embedding-decay 0.0 --lr-scheduler cosineannealinglr --lr-min 0.00001 --lr-warmup-method linear --lr-warmup-epochs 20 --lr-warmup-decay 0.01 --amp --label-smoothing 0.1 --mixup-alpha 0.8 --clip-grad-norm 5.0 --cutmix-alpha 1.0 --random-erase 0.25 --interpolation bicubic --auto-augment ta_wide --model-ema --ra-sampler --ra-reps 4 --val-resize-size 256 --val-crop-size 256 --train-crop-size 256
Epoch: [221] Total time: 0:25:25
Test: [ 0/98] eta: 0:06:23 loss: 1.3831 (1.3831) acc1: 91.4062 (91.4062) acc5: 98.4375 (98.4375) time: 3.9118 data: 3.6845 max mem: 14880
Test: Total time: 0:00:25
Test: Acc@1 77.738 Acc@5 94.460
Test: EMA [ 0/98] eta: 0:05:15 loss: 1.3172 (1.3172) acc1: 90.6250 (90.6250) acc5: 99.2188 (99.2188) time: 3.2230 data: 2.9882 max mem: 14880
Test: EMA Total time: 0:00:25
Test: EMA Acc@1 80.012 Acc@5 95.286
torchrun --nproc_per_node=4 train.py --model swin_v2_t --epochs 300 --batch-size 128 --opt adamw --lr 0.001 --weight-decay 0.05 --norm-weight-decay 0.0 --bias-weight-decay 0.0 --transformer-embedding-decay 0.0 --lr-scheduler cosineannealinglr --lr-min 0.00001 --lr-warmup-method linear --lr-warmup-epochs 20 --lr-warmup-decay 0.01 --amp --label-smoothing 0.1 --mixup-alpha 0.8 --clip-grad-norm 5.0 --cutmix-alpha 1.0 --random-erase 0.25 --interpolation bicubic --auto-augment ta_wide --model-ema --ra-sampler --ra-reps 4 --val-resize-size 256 --val-crop-size 256 --train-crop-size 256
Epoch: [221] Total time: 0:25:25 Test: [ 0/98] eta: 0:06:23 loss: 1.3831 (1.3831) acc1: 91.4062 (91.4062) acc5: 98.4375 (98.4375) time: 3.9118 data: 3.6845 max mem: 14880 Test: Total time: 0:00:25 Test: Acc@1 77.738 Acc@5 94.460 Test: EMA [ 0/98] eta: 0:05:15 loss: 1.3172 (1.3172) acc1: 90.6250 (90.6250) acc5: 99.2188 (99.2188) time: 3.2230 data: 2.9882 max mem: 14880 Test: EMA Total time: 0:00:25 Test: EMA Acc@1 80.012 Acc@5 95.286
Thank you for sharing the commands. Although you previously mention I missed that in v2 they used a different resolution size. I launched new jobs to train all the variants with the correct resolution size and seems are looking better now.
@ain-soph thank you for your patience and great work! I managed to train all three variants and the results look good.
My plan is to update this PR in the next couple of ours with the model weights.
Trainning commands:
# swin_v2_t
python -u run_with_submitit.py --timeout 3000 --ngpus 8 --nodes 1 --model swin_v2_t --epochs 300 \
--batch-size 128 --opt adamw --lr 0.001 --weight-decay 0.05 --norm-weight-decay 0.0 \
--bias-weight-decay 0.0 --transformer-embedding-decay 0.0 --lr-scheduler cosineannealinglr \
--lr-min 0.00001 --lr-warmup-method linear --lr-warmup-epochs 20 --lr-warmup-decay 0.01 \
--amp --label-smoothing 0.1 --mixup-alpha 0.8 --clip-grad-norm 5.0 --cutmix-alpha 1.0 \
--random-erase 0.25 --interpolation bicubic --auto-augment ta_wide --model-ema --ra-sampler \
--ra-reps 4 --val-resize-size 256 --val-crop-size 256 --train-crop-size 256
# swin_v2_s
python -u run_with_submitit.py --timeout 3000 --ngpus 8 --nodes 1 --model swin_v2_s --epochs 300 \
--batch-size 128 --opt adamw --lr 0.001 --weight-decay 0.05 --norm-weight-decay 0.0 \
--bias-weight-decay 0.0 --transformer-embedding-decay 0.0 --lr-scheduler cosineannealinglr \
--lr-min 0.00001 --lr-warmup-method linear --lr-warmup-epochs 20 --lr-warmup-decay 0.01 --amp --label-smoothing 0.1 --mixup-alpha 0.8 --clip-grad-norm 5.0 --cutmix-alpha 1.0 \
--random-erase 0.25 --interpolation bicubic --auto-augment ta_wide --model-ema --ra-sampler \
--ra-reps 4 --val-resize-size 256 --val-crop-size 256 --train-crop-size 256
# swin_v2_b
python -u run_with_submitit.py --timeout 3000 --ngpus 8 --nodes 1 --model swin_v2_b --epochs 300 \
--batch-size 128 --opt adamw --lr 0.001 --weight-decay 0.05 --norm-weight-decay 0.0 \
--bias-weight-decay 0.0 --transformer-embedding-decay 0.0 --lr-scheduler cosineannealinglr \
--lr-min 0.00001 --lr-warmup-method linear --lr-warmup-epochs 20 --lr-warmup-decay 0.01 \
--amp --label-smoothing 0.1 --mixup-alpha 0.8 --clip-grad-norm 5.0 --cutmix-alpha 1.0 \
--random-erase 0.25 --interpolation bicubic --auto-augment ta_wide --model-ema --ra-sampler \
--ra-reps 4 --val-resize-size 256 --val-crop-size 256 --train-crop-size 256
Test commands and acuracies
# swin_v2_t
srun -p dev --cpus-per-task=96 -t 24:00:00 --gpus-per-node=1 torchrun --nproc_per_node=1 train.py \
--model swin_v2_t --test-only --resume $EXPERIMENTS_PATH/44757/model_299.pth --interpolation bicubic \
--val-resize-size 260 --val-crop-size 256
# Test: Acc@1 82.072 Acc@5 96.132
# swin_v2_s
srun -p dev --cpus-per-task=96 -t 24:00:00 --gpus-per-node=1 torchrun --nproc_per_node=1 train.py \
--model swin_v2_s --test-only --resume $EXPERIMENTS_PATH/44758/model_299.pth --interpolation bicubic \
--val-resize-size $260--val-crop-size 256
# Test: Acc@1 83.712 Acc@5 96.816
# swin_v2_b
srun -p dev --cpus-per-task=96 -t 24:00:00 --gpus-per-node=1 torchrun --nproc_per_node=1 train.py \
--model swin_v2_b --test-only --resume $EXPERIMENTS_PATH/44759/model_299.pth --interpolation bicubic \
--val-resize-size 272 --val-crop-size 256
# Test: Acc@1 84.112 Acc@5 96.864
I guess the only final thing on our plate is to fix the memory issue. I’ll work on it in the following week.
Btw, should we provide porting model weights from official repo as an alternative?
@ain-soph @jdsgomes Awesome work! Having SwinV2 in TorchVision is really awesome. Looking forward using them.
Btw, should we provide porting model weights from official repo as an alternative?
I think given we were able to reproduce the accuracy of the paper, there is no point offering both. What would be interesting on the future is to offer higher accuracy weights by using newer training recipes.