Feature Extraction (EVA)
Hello author,
In the code documentation its mentioned that you used EVA model as Global Encoder.
To extract rich and comprehensive emotion features, we use the HuBERT model as the Audio Encoder, the### EVA model as the Global Encoder, the MAE model as the Local Encoder, and the VideoMAE model as the Temporal Encoder. In practice, to save GPU memory, we do not load all Encoders directly onto the GPU but instead load the extracted features. You can download the processed feature files through the following Google Drive link and save them to the dataset folder.
https://drive.google.com/drive/folders/1DqGSBgpRo7TuGNqMJo9BYg6smJE20MG4?usp=drive_link
But in this drive, there is no features related to the EVA model. Was it not used in the end?
Also, in the drive of feature extraction
https://drive.google.com/drive/folders/1DqGSBgpRo7TuGNqMJo9BYg6smJE20MG4
There is nothing related with EVA(only MAE and videoMAE). If the emotion-llama used EVA, Could you give me more details of how do you obtain these feature? Could you clarify this?
Thank you in advance
Hello,
Thank you for your question!
You're right to notice that there are no pre-extracted EVA features in the shared Google Drive folder. This is because the EVA model, used as the Global Encoder, is integrated directly into the Emotion-LLaMA framework—you don't need to manually extract EVA features in advance. During inference or training, the model automatically processes the first frame of the input video and feeds it into the EVA model.
You can refer to the relevant part of the code here for implementation details: https://github.com/ZebangCheng/Emotion-LLaMA/blob/35b09357075cd5ee4c804d686680288ff23f55db/minigpt4/models/eva_vit.py#L415-L442
https://github.com/ZebangCheng/Emotion-LLaMA/blob/35b09357075cd5ee4c804d686680288ff23f55db/minigpt4/models/minigpt_v2.py#L92-L116
Let me know if you need help locating the specific code or modifying it for your use.
请问,为什么没有 MAE 和 videoMAE的代码,您是怎么使用 MAE 和 videoMAE的呢?
具体的特征提取代码可以在以下google driver获取: https://drive.google.com/drive/folders/1WpQBV7XQsGnLr6B7bv4kKn4suW-o8fWO?usp=sharing
由于我在提取特征时,忘记设置随机种子,导致后续我们按照同样的步骤提取的特征也会和实际用的特征有一点点差别。如果想完全复现我们在论文中的实验结果,请使用我们提取好的特征: https://drive.google.com/drive/folders/1Atm7x_J4OQsBQ32vvi-c2oM3m3P07WTF?usp=sharing
如果想用在其它数据集,可以按照代码提取对应的特征。
非常感谢您的回答。
请问您MAE和VideoMAE是用openface裁剪好的图片,然后EAV用的是视频(没有被裁剪)的第一帧吗?
我有个疑问,为什么在
def encode_img(self, image, video_features):
# device = 'cuda:0'
device = image.device
if len(image.shape) > 4:
image = image.reshape(-1, *image.shape[-3:])
with self.maybe_autocast():
image_feats = self.visual_encoder(image) # [1, 1025, 1408]
image_embeds = self.ln_vision(image_feats).to(device) # [1, 1025, 1408]
image_cls_tk = image_embeds[:, :1, :] # [1, 1, 1408]
cls_tk_feats = self.cls_tk_llama_proj(image_cls_tk) # [1, 1, 4096]
image_embeds = image_embeds[:, 1:, :] # [1, 1024, 1408]
bs, pn, hs = image_embeds.shape
image_embeds = image_embeds.view(bs, int(pn / 4), int(hs * 4)) # [1, 256, 5632]
image_inputs_llama = self.llama_proj(image_embeds) # [1, 256, 4096]
video_features = video_features.to(device) # [1, 3, 1024]
video_features_split = torch.split(video_features, 1, dim=1)
output1 = self.feats_llama_proj1(video_features_split[0].squeeze(1))
output2 = self.feats_llama_proj2(video_features_split[1].squeeze(1))
output3 = self.feats_llama_proj3(video_features_split[2].squeeze(1))
video_feats = torch.stack([output1, output2, output3], dim=1)
inputs_llama = torch.cat((image_inputs_llama, video_feats, cls_tk_feats), dim=1) # cls_tk_feats
# inputs_llama = torch.cat((image_inputs_llama, video_feats), dim=1)
atts_llama = torch.ones(inputs_llama.size()[:-1], dtype=torch.long).to(image.device)
return inputs_llama, atts_llama
image_inputs_llama的特征这么大,会不会导致模型其实大部分学习的都是视频第一帧的信息。
我们一开始也是认为EVA的特征太大,会影响其它特征的学习,所以就只使用EVA的Class token(image_cls_tk = image_embeds[:, :1, :] # [1, 1, 1408])和其它特征融合输入到大模型,但是效果较差。 后面我们保留了EVA的所有特征(image_inputs_llama = self.llama_proj(image_embeds) # [1, 256, 4096])和其它特征一起输入到大模型,效果很好,超出了我们的预期。我们认为原本的EVA特征对应的是大模型的world knowledge,然后我们新添加的特征是professional knowledge。 最后我们前期通过简单的消融实验,只包含EVA特征,其它特征设置为0向量训练模型,测试结果较差。所以这样的组合是我们的实验效果最好的。