InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

Can i extract image and text feature respectively in InternVL-G model?

Open zhangleiedu opened this issue 1 year ago • 4 comments

hi, can i extract image and text feature respectively in InternVL-G model? when read code, i found the cross-attention layers in QLLaMA are the shared parameters bewteen image and text feature branch, but there seems to be some kind of interaction in paper figure4, like Q-Former in BLIP-2. So can i use model.encode_image() or model.encode_text() individually?

zhangleiedu avatar Dec 29 '23 07:12 zhangleiedu

Hello, of course. You can use model.encode_image() and model.encode_text() individually.

The following code is the forward function of InternVL, like a standard CLIP model. You can use model.encode_image(image, mode) and model.encode_text(text) individually.

  def forward(self, image, text, mode='InternVL-C'):
      assert mode in ['InternVL-C', 'InternVL-G'], 'mode must be InternVL-C or InternVL-G'
      image_features = self.encode_image(image, mode)
      text_features = self.encode_text(text)

      # normalized features
      image_features = image_features / image_features.norm(dim=1, keepdim=True)
      text_features = text_features / text_features.norm(dim=1, keepdim=True)

      # cosine similarity as logits
      logit_scale = self.logit_scale.exp()
      logits_per_image = logit_scale * image_features @ text_features.t()
      logits_per_text = logits_per_image.t()

      return logits_per_image, logits_per_text

czczup avatar Dec 29 '23 11:12 czczup

How about getting multi-modal embedding? Something like output of QFormer in BLIP which i think is the output of Qllama in your proposed work.

hmd78 avatar Dec 30 '23 11:12 hmd78

You can use this function to get the output embedding of QLLaMA:

https://github.com/OpenGVLab/InternVL/blob/main/clip_benchmark/clip_benchmark/models/internvl_huggingface/modeling_internvl.py#L337

czczup avatar Jan 02 '24 04:01 czczup

You can use this function to get the output embedding of QLLaMA:

https://github.com/OpenGVLab/InternVL/blob/main/clip_benchmark/clip_benchmark/models/internvl_huggingface/modeling_internvl.py#L337

During testing, it seemed that only image information was used, and there is no text information as a guide?

zhangleiedu avatar Jan 02 '24 11:01 zhangleiedu