JackieWu
JackieWu
Hi @gudrb , thanks for your attention to our work! In Mini-DeiT, the transformation for MLP is the relative position encoding https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L117 In Mini-Swin, the transformation for MLP is the...
> On the MiniViT paper, > > We make several modifi�cations on DeiT: First, we remove the [class] token. The model is attached with a global average pooling layer and...
Hi @gudrb , The following code creates a list of LayerNorm, where the number of LayerNorm is `repeated_times`. https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L145-L146 RepeatedModuleList will select the `self._repeated_id`-th LayerNorm to forward. https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L28-L29 In `RepeatedMiniBlock`,...
Hi @gudrb , here is the application of the weight transformation. https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L103-L109
In the equation 7, we ignore the relative position encoding. The iRPE is only applied on Mini-DeiT.
I'm sorry that I reply late. I have fixed it in https://github.com/wkcn/LookaheadOptimizer-mx/commit/d36ac1d9b4c37e28e7c48120c0c67c8a2b220ddd Thank you for reporting it : )
Thank you for pointing it out! This implementation doesn't reset the momentum in outer loop. I will try to fix it.
Hi @WeiSQ-zju , thanks for your attention to our work! 1. The master weight is an FP16 tensor with a scaling factor. It will be converted to the weight, which...
@WeiSQ-zju Sorry for late reply. > That's to say the overflow ratio of g'i is less than 0.001%, but when N is large, will the overflow ratio of g is...
Thanks for your interest to our work! We will support for MS-AMP in FSDP : )