Swin-Transformer icon indicating copy to clipboard operation
Swin-Transformer copied to clipboard

Normalization in the PatchEmbed

Open Phuoc-Hoan-Le opened this issue 2 years ago • 2 comments

Hi,

For normal ViT, there is no normalization layer after the nn.Conv2d in the patch embeding. However, for Swin Transformer, there is a normalization layer after the nn.Conv2d in the patch embeding.

Why did you decide to add normalization after that nn.Conv2d? Have you tried training Swin without adding a normalization layer after the nn.Conv2d in the patch embedding and see if it's better?

Phuoc-Hoan-Le avatar Feb 10 '23 16:02 Phuoc-Hoan-Le

Hi,

I believe the paper answers it. large vision model has instability problems in training and they found that discrepancies in the activation amplitudes across layers become significant. that's one of the main problems of older vision transformers addressed by this paper.

asparsa avatar Feb 19 '23 15:02 asparsa

Hi,

I believe the paper answers it. large vision model has instability problems in training and they found that discrepancies in the activation amplitudes across layers become significant. that's one of the main problems of older vision transformers addressed by this paper.

I think you're talking about SwinV2 where they discussed why they used res-post-norm rather than pre-norm. Here, I am asking more specifically on why did they decide to add normalization after that nn.Conv2d in the patch emdedding in the first layer which is even found on SwinV1.

Phuoc-Hoan-Le avatar Feb 19 '23 23:02 Phuoc-Hoan-Le