About the raw token lens

Open srymaker opened this issue 1 year ago • 2 comments

截屏2024-08-23 11 46 49

Thanks for your great work! I wanna know how you compute the raw token lens, just like the 729 in the image.

Aug 23 '24 03:08 srymaker

Hi, you can find the raw vision token length in two ways:

Print the output shape of the visual features from the ViT in an MLLM, which looks like (batch size, number of visual tokens, token dimension);
Calculate it directly using the input image resolution (e.g., 384px) and the ViT patch size (e.g., 14x14). For example, with a 384px input image and the openai/clip-vit-large-patch14 model, where 'Patch 14' means the image is divided into 14x14 pixel patches, the visual token length would be (384//14)**2 = 729.

Aug 27 '24 03:08 yaolinli

Thank you for your patience.

Aug 27 '24 06:08 srymaker