CLIP How can i get every word feature in the text instand of the text feature?

`class CLIPTransformer(nn.Module): def init(self, config: Config): super(CLIPTransformer, self).init() self.config = config

    if self.config.huggingface:
        from transformers import CLIPModel
        self.clip = CLIPModel.from_pretrained(self.config.clip_type)
        # downloading pytorch_model.bin 577M
    else:
        from model.clip_model import load_clip
        self.clip = load_clip(config.clip_arch)

    config.pooling_type = 'transformer'
    self.pool_frames = Transformer(config)


def forward(self, data, return_all_frames=False):
    batch_size = data['video'].shape[0]
    text_data = data['text']
    video_data = data['video']
    video_data = video_data.reshape(-1, 3, self.config.input_res, self.config.input_res)
    
    if self.config.huggingface:
        text_features = self.clip.get_text_features(**text_data) # 512【8，512】
        video_features = self.clip.get_image_features(video_data) # 【96，512】
    else:
        text_features = self.clip.encode_text(text_data)
        video_features = self.clip.encode_image(video_data)`

I want to get the all the word features in the text, instand of the text feature text_features = self.clip.get_text_features(**text_data),How can i make that?

Sep 01 '22 13:09 Cola-any

Hi,

I have the same question as well. I wonder have you sloved it?

Nov 19 '22 16:11 yangzhao1230

Sure. You can locate the library file "from transformers import CLIPModel". Then you may find the following code，

text_outputs = self.text_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        pooled_output = text_outputs[1]
        text_features = self.text_projection(pooled_output)
        return text_features

We see that text_outputs[1] is raw output text_features. Modify the code as follow,

text_outputs = self.text_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        pooled_output = text_outputs[1]
        word_features = text_outputs[0]
        text_features = self.text_projection(pooled_output)
        word_features = self.text_projection(word_features)

        return text_features, word_features

So you get word_features here.

Nov 21 '22 01:11 Cola-any

CLIP CLIP copied to clipboard

How can i get every word feature in the text instand of the text feature?

CLIP
CLIP copied to clipboard