Swin-Transformer
Swin-Transformer copied to clipboard
Normalization in the PatchEmbed
Hi,
For normal ViT, there is no normalization layer after the nn.Conv2d in the patch embeding. However, for Swin Transformer, there is a normalization layer after the nn.Conv2d in the patch embeding.
Why did you decide to add normalization after that nn.Conv2d? Have you tried training Swin without adding a normalization layer after the nn.Conv2d in the patch embedding and see if it's better?
Hi,
I believe the paper answers it. large vision model has instability problems in training and they found that discrepancies in the activation amplitudes across layers become significant. that's one of the main problems of older vision transformers addressed by this paper.
Hi,
I believe the paper answers it. large vision model has instability problems in training and they found that discrepancies in the activation amplitudes across layers become significant. that's one of the main problems of older vision transformers addressed by this paper.
I think you're talking about SwinV2 where they discussed why they used res-post-norm rather than pre-norm. Here, I am asking more specifically on why did they decide to add normalization after that nn.Conv2d in the patch emdedding in the first layer which is even found on SwinV1.