Some questions about encoder-free VLMs
How wonderful the Work Eve series is!!! I have the following questions that I would like to ask you, and I would be extremely grateful if you could provide some answers.
Firstly, regarding the convolutional architecture, can the dynamic resolution used in the InternVL series [1] enhance performance? I have observed that directly increasing the resolution, such as in ConvLava without using dynamic resolution, results in slow token growth but normal performance. However, other encoder-free models like Mono-InternVL [2] and HoVLE [3] do employ dynamic resolution. In your opinion, should encoder-free models use dynamic resolution?
Secondly, for both encoder-free and encoder-based models, the attention values in the first few layers typically show relatively weak interaction between user prompt tokens and vision tokens [2]. Do you think this is a key factor affecting the performance of encoder-free models?
[1] Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
How wonderful the Work Eve series is!!! I have the following questions that I would like to ask you, and I would be extremely grateful if you could provide some answers.
Firstly, regarding the convolutional architecture, can the dynamic resolution used in the InternVL series [1] enhance performance? I have observed that directly increasing the resolution, such as in ConvLava without using dynamic resolution, results in slow token growth but normal performance. However, other encoder-free models like Mono-InternVL [2] and HoVLE [3] do employ dynamic resolution. In your opinion, should encoder-free models use dynamic resolution?
Secondly, for both encoder-free and encoder-based models, the attention values in the first few layers typically show relatively weak interaction between user prompt tokens and vision tokens [2]. Do you think this is a key factor affecting the performance of encoder-free models?
Sorry for the late reply.
(1) Dynamic resolution matters for real-world use, so we believe it is necessary. In EVEv2, our results show that “AnyRatio HD” gives the best performance overall. “AnyResolution” starts off worse, likely due to limited and imbalanced training data, but improves as data scale increases. We think that with enough well-balanced data, AnyResolution can offer better efficiency and flexibility for handling real-world images of different sizes.
(2) We observed the same phenomenon in our experiments. This may suggest that visual and textual tokens are encoded independently in the early layers of encoder-free VLMs, highlighting the need for modality-specific structural disentanglement as implemented in EVEv2.
Thanks for your reply!!!