LAVIS
LAVIS copied to clipboard
Question for the meaning of BLIP2 embedding.
Hi, I want to know if the image feature and multi modal feature has position meaning to the original image?
Like the blip2_feature_extractor produce (1,32,768) for both image feature and multi modal feature, are they corresponding to the same patch? And does the patch follow the image order like from 0 to 31, it is corresponding to vision encoder's split?