InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

Question about ViCO loss

Open youngxie0702 opened this issue 3 months ago • 2 comments

Thank you for releasing such amazing work. I would like to ask a few questions regarding the ViCO loss described in the Visual Consistency Learning section.

  1. The compression rate is uniformly sampled from {1/4, 1/16}. However, the input dimension of the MLP after the pixel-shuffle operation depends on the compression rate. How is the input dimension of the MLP set during the training process?

  2. The ViCO loss computes the KL divergence between the reference model and the policy model. Is the KL loss computed per token and then averaged?

Looking for your reply ~ @czczup

youngxie0702 avatar Sep 04 '25 13:09 youngxie0702

Thank you for your interest in our work!

Regarding the compression rate: we randomly compress each ViT feature at the tile level with either 1/4 or 1/16. Each compressed feature is then fed into its corresponding MLP for projection. The resulting tokens with mixed compression rates are interleaved and then passed to the LLM. For example, if the ViT outputs [1024, 1024, 1024, 1024] tokens, these might be randomly compressed to [256, 64, 256, 64] before being input to the LLM.

For the ViCO loss: yes, the KL divergence is computed per token and then averaged, following the standard procedure used in most LLM distillation setups.

By the way, we will soon release the InternVL-Flash model corresponding to ViCO.

Sorr7maker avatar Sep 08 '25 13:09 Sorr7maker

Thank you for your interest in our work!

Regarding the compression rate: we randomly compress each ViT feature at the tile level with either 1/4 or 1/16. Each compressed feature is then fed into its corresponding MLP for projection. The resulting tokens with mixed compression rates are interleaved and then passed to the LLM. For example, if the ViT outputs [1024, 1024, 1024, 1024] tokens, these might be randomly compressed to [256, 64, 256, 64] before being input to the LLM.

For the ViCO loss: yes, the KL divergence is computed per token and then averaged, following the standard procedure used in most LLM distillation setups.

By the way, we will soon release the InternVL-Flash model corresponding to ViCO.

Thanks for your reply~ I have another question: Do you have plans to support Flash Attention 3?

youngxie0702 avatar Sep 10 '25 02:09 youngxie0702