ml-cvnets
ml-cvnets copied to clipboard
Why MobileViT neither loses the patch order nor thespatial order of pixels within each patch?
Hi, Thank you for the great work.
I am sorry I don't understand why does MobileViT neither loses the patch order nor the spatial order of pixels within each patch?
In Figure 4, I think blue blocks are still permutation equivariant.
Would you mind giving an explanation?
By reading the author's paper, I have the same question.
MobileViT unfolds an input with shape [Batch, Channels, Height, Width]
into [Batch, Number of pixels per patch, number of patches, Channels]
and learn global representations using transformers. The output of transformer has shape of [Batch, Number of pixels per patch, number of patches, Channels]
. The folding operation applied on this resulting tensor produces an output of shape [Batch, Channels, Height, Width]
.
Unlike transformers, MobileViT does not compress the number of pixels per patch. Therefore, after folding operation, we are able to produce the output of the same size as the input.