Why does InternVL3 use class_embedding in the code but discard it later?
I noticed that InternVL3's code includes a class_embedding variable , but it seems to be discarded or unused in later stages. Could you clarify:
What was the original purpose of this class_embedding?
Why was it removed or left unused in the final implementation?
Are there plans to repurpose it in future updates?
你说的代码在哪里,是这个吗 https://github.com/OpenGVLab/InternVL/blob/51ac0b1daf0589c00c760681470006768b396290/clip_benchmark/clip_benchmark/models/internvl_huggingface/modeling_internvl.py#L60
This class_embedding is introduced during the vision-only pre-training stage, where it is used to compute the CLIP loss. We remove it to ensure compatibility with pixel shuffle, which compresses the visual tokens in the 2D space, making the class_embedding token redundant in this case.