CLIP
CLIP copied to clipboard
How can i get every word feature in the text instand of the text feature?
`class CLIPTransformer(nn.Module): def init(self, config: Config): super(CLIPTransformer, self).init() self.config = config
if self.config.huggingface:
from transformers import CLIPModel
self.clip = CLIPModel.from_pretrained(self.config.clip_type)
# downloading pytorch_model.bin 577M
else:
from model.clip_model import load_clip
self.clip = load_clip(config.clip_arch)
config.pooling_type = 'transformer'
self.pool_frames = Transformer(config)
def forward(self, data, return_all_frames=False):
batch_size = data['video'].shape[0]
text_data = data['text']
video_data = data['video']
video_data = video_data.reshape(-1, 3, self.config.input_res, self.config.input_res)
if self.config.huggingface:
text_features = self.clip.get_text_features(**text_data) # 512【8,512】
video_features = self.clip.get_image_features(video_data) # 【96,512】
else:
text_features = self.clip.encode_text(text_data)
video_features = self.clip.encode_image(video_data)`
I want to get the all the word features in the text, instand of the text feature text_features = self.clip.get_text_features(**text_data)
,How can i make that?
Hi,
I have the same question as well. I wonder have you sloved it?
Sure. You can locate the library file "from transformers import CLIPModel". Then you may find the following code,
text_outputs = self.text_model(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
pooled_output = text_outputs[1]
text_features = self.text_projection(pooled_output)
return text_features
We see that text_outputs[1] is raw output text_features. Modify the code as follow,
text_outputs = self.text_model(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
pooled_output = text_outputs[1]
word_features = text_outputs[0]
text_features = self.text_projection(pooled_output)
word_features = self.text_projection(word_features)
return text_features, word_features
So you get word_features here.