Cream icon indicating copy to clipboard operation
Cream copied to clipboard

transformations in MiniViT paper

Open gudrb opened this issue 1 year ago • 9 comments

Hello, I have a question about the transformations in the MiniViT paper.

I could find the first transformation (implemented in the MiniAttention class) in the code: https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L104

However, i couldn't find the second transformation in the code (which should be before or inside the MLP in the MiniBlock class) https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L137

Could you please let me know where the second transformation is?

gudrb avatar Feb 22 '24 16:02 gudrb

Hi @gudrb , thanks for your attention to our work!

In Mini-DeiT, the transformation for MLP is the relative position encoding https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L117

In Mini-Swin, the transformation for MLP is the depth-wise convolution layer https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-Swin/models/swin_transformer_minivit.py#L275

wkcn avatar Feb 23 '24 02:02 wkcn

On the MiniViT paper,

We make several modifications on DeiT: First, we remove the [class] token. The model is attached with a global average pooling layer and a fully-connected layer for image classification. We also utilize relative position encoding to introduce inductive bias to boost the model convergence [52,59]. Finally, based on our observation that transformation for FFN only brings limited performance gains in DeiT, we remove the block to speed up both training and inference.

-> Does this mean that in MiniDeiT model, IRPE is utilized (for the value), and the MLP transformation is removed, leaving only the attention transformation?

gudrb avatar Feb 23 '24 02:02 gudrb

On the MiniViT paper,

We make several modifi�cations on DeiT: First, we remove the [class] token. The model is attached with a global average pooling layer and a fully-connected layer for image classification. We also utilize relative position encoding to introduce inductive bias to boost the model convergence [52,59]. Finally, based on our observation that transformation for FFN only brings limited performance gains in DeiT, we remove the block to speed up both training and inference.

-> Does this mean that in MiniDeiT model, IRPE is utilized (for the value), and the MLP transformation is removed, leaving only the attention transformation?

Yes. I correct my statement. There is no transformation for FFN in Mini-DeiT. iRPE is utilized for only the key. https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L97

https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_deit_models.py#L17

wkcn avatar Feb 23 '24 05:02 wkcn

Hello,

I have a question regarding the implementation of layer normalization in the MiniViT paper and the corresponding code. Specifically, I am referring to how layer normalization is applied between transformer blocks.

In the MiniViT paper, it is mentioned that layer normalization between transformer blocks is not shared, and I believe the code reflects this. However, I am confused about how the RepeatedModuleList applies layer normalization multiple times and how it ensures that the normalizations are not shared.

Here is the relevant code snippet for the MiniBlock class: https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L144

thank you.

gudrb avatar Jul 01 '24 07:07 gudrb

Hi @gudrb ,

The following code creates a list of LayerNorm, where the number of LayerNorm is repeated_times. https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L145-L146

RepeatedModuleList will select the self._repeated_id-th LayerNorm to forward.

https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L28-L29

In RepeatedMiniBlock, _repeated_id is updated. Therefore, each LayerNorm, conv and RPE are executed once but other modules are executed multiple times.

https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L174-L180

wkcn avatar Jul 02 '24 01:07 wkcn

Hello,

Thank you for your kind reply.

I noticed that Relative Position Encoding (RPE) is applied only on the key value. In the MiniViT paper, I couldn't see the explicit application of it in the equations.

20240702_164802 Does this mean that (K^T_m) already represents the image with the relative position applied (using the piecewise function, product method, contextual mode, and unshared)?

Thank you!

gudrb avatar Jul 02 '24 07:07 gudrb

Hi @gudrb , here is the application of the weight transformation.

https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L103-L109

wkcn avatar Jul 03 '24 02:07 wkcn

20240702_164802 In the equations provided in the MiniViT paper, is K_m^T actually representing (K'_m + r_m)^T, where r are trainable positional identifiers? In the code, iRPE is used, but the notation is not explicitly shown in the equations from the paper. Could you confirm if this interpretation is correct?

gudrb avatar Jul 03 '24 03:07 gudrb

In the equation 7, we ignore the relative position encoding. The iRPE is only applied on Mini-DeiT.

wkcn avatar Jul 03 '24 04:07 wkcn