ml-cvnets icon indicating copy to clipboard operation
ml-cvnets copied to clipboard

Why MobileViT neither loses the patch order nor thespatial order of pixels within each patch?

Open ShoufaChen opened this issue 2 years ago • 2 comments

Hi, Thank you for the great work.

I am sorry I don't understand why does MobileViT neither loses the patch order nor the spatial order of pixels within each patch?

In Figure 4, I think blue blocks are still permutation equivariant.

Would you mind giving an explanation?

ShoufaChen avatar Mar 20 '22 08:03 ShoufaChen

By reading the author's paper, I have the same question.

WYHZQ avatar Oct 13 '22 13:10 WYHZQ

MobileViT unfolds an input with shape [Batch, Channels, Height, Width] into [Batch, Number of pixels per patch, number of patches, Channels] and learn global representations using transformers. The output of transformer has shape of [Batch, Number of pixels per patch, number of patches, Channels]. The folding operation applied on this resulting tensor produces an output of shape [Batch, Channels, Height, Width].

Unlike transformers, MobileViT does not compress the number of pixels per patch. Therefore, after folding operation, we are able to produce the output of the same size as the input.

sacmehta avatar Oct 30 '22 02:10 sacmehta