DeCo
DeCo copied to clipboard
About the raw token lens
Thanks for your great work! I wanna know how you compute the raw token lens, just like the 729 in the image.
Hi, you can find the raw vision token length in two ways:
- Print the output shape of the visual features from the ViT in an MLLM, which looks like (batch size, number of visual tokens, token dimension);
- Calculate it directly using the input image resolution (e.g., 384px) and the ViT patch size (e.g., 14x14). For example, with a 384px input image and the
openai/clip-vit-large-patch14model, where 'Patch 14' means the image is divided into 14x14 pixel patches, the visual token length would be (384//14)**2 = 729.
Thank you for your patience.