InternVL
InternVL copied to clipboard
Can i extract image and text feature respectively in InternVL-G model?
hi, can i extract image and text feature respectively in InternVL-G model? when read code, i found the cross-attention layers in QLLaMA are the shared parameters bewteen image and text feature branch, but there seems to be some kind of interaction in paper figure4, like Q-Former in BLIP-2. So can i use model.encode_image() or model.encode_text() individually?
Hello, of course. You can use model.encode_image() and model.encode_text() individually.
The following code is the forward function of InternVL, like a standard CLIP model. You can use model.encode_image(image, mode)
and model.encode_text(text)
individually.
def forward(self, image, text, mode='InternVL-C'):
assert mode in ['InternVL-C', 'InternVL-G'], 'mode must be InternVL-C or InternVL-G'
image_features = self.encode_image(image, mode)
text_features = self.encode_text(text)
# normalized features
image_features = image_features / image_features.norm(dim=1, keepdim=True)
text_features = text_features / text_features.norm(dim=1, keepdim=True)
# cosine similarity as logits
logit_scale = self.logit_scale.exp()
logits_per_image = logit_scale * image_features @ text_features.t()
logits_per_text = logits_per_image.t()
return logits_per_image, logits_per_text
How about getting multi-modal embedding? Something like output of QFormer in BLIP which i think is the output of Qllama in your proposed work.
You can use this function to get the output embedding of QLLaMA:
https://github.com/OpenGVLab/InternVL/blob/main/clip_benchmark/clip_benchmark/models/internvl_huggingface/modeling_internvl.py#L337
You can use this function to get the output embedding of QLLaMA:
https://github.com/OpenGVLab/InternVL/blob/main/clip_benchmark/clip_benchmark/models/internvl_huggingface/modeling_internvl.py#L337
During testing, it seemed that only image information was used, and there is no text information as a guide?