ml-cvnets Why MobileViT neither loses the patch order nor thespatial order of pixels within each patch?

Why MobileViT neither loses the patch order nor thespatial order of pixels within each patch?

Open ShoufaChen opened this issue 2 years ago • 2 comments

Hi, Thank you for the great work.

I am sorry I don't understand why does MobileViT neither loses the patch order nor the spatial order of pixels within each patch?

In Figure 4, I think blue blocks are still permutation equivariant.

Would you mind giving an explanation?

Mar 20 '22 08:03 ShoufaChen

By reading the author's paper, I have the same question.

Oct 13 '22 13:10 WYHZQ

MobileViT unfolds an input with shape [Batch, Channels, Height, Width] into [Batch, Number of pixels per patch, number of patches, Channels] and learn global representations using transformers. The output of transformer has shape of [Batch, Number of pixels per patch, number of patches, Channels]. The folding operation applied on this resulting tensor produces an output of shape [Batch, Channels, Height, Width].

Unlike transformers, MobileViT does not compress the number of pixels per patch. Therefore, after folding operation, we are able to produce the output of the same size as the input.

Oct 30 '22 02:10 sacmehta

ml-cvnets ml-cvnets copied to clipboard

Why MobileViT neither loses the patch order nor thespatial order of pixels within each patch?

ml-cvnets
ml-cvnets copied to clipboard