ViT architecture and dynamic resolution

Open JohnnyRacer opened this issue 2 years ago • 0 comments

Hello, I was wondering why NaViT or an architecture similar to it was not used as the vision transformer architecture. NaViT natively (hence native resolution) supports multi-resolution training as one of its defining features and a similar architecture was used for OpenAI's Sora to allow for good visual fidelity with differing resolutions. In the Latte paper here section 4.1 it states that the model was trained only on square images/videos and would require resizing to process non-square images/videos.

Mar 11 '24 22:03 JohnnyRacer