Emotion-LLaMA Feature Extraction (EVA)

Hello author,

In the code documentation its mentioned that you used EVA model as Global Encoder.

To extract rich and comprehensive emotion features, we use the HuBERT model as the Audio Encoder, the### EVA model as the Global Encoder, the MAE model as the Local Encoder, and the VideoMAE model as the Temporal Encoder. In practice, to save GPU memory, we do not load all Encoders directly onto the GPU but instead load the extracted features. You can download the processed feature files through the following Google Drive link and save them to the dataset folder.

https://drive.google.com/drive/folders/1DqGSBgpRo7TuGNqMJo9BYg6smJE20MG4?usp=drive_link

But in this drive, there is no features related to the EVA model. Was it not used in the end?

Also, in the drive of feature extraction

https://drive.google.com/drive/folders/1DqGSBgpRo7TuGNqMJo9BYg6smJE20MG4

There is nothing related with EVA(only MAE and videoMAE). If the emotion-llama used EVA, Could you give me more details of how do you obtain these feature? Could you clarify this?

Thank you in advance

May 15 '25 08:05 shinodalab-isaac

Hello,

Thank you for your question!

You're right to notice that there are no pre-extracted EVA features in the shared Google Drive folder. This is because the EVA model, used as the Global Encoder, is integrated directly into the Emotion-LLaMA framework—you don't need to manually extract EVA features in advance. During inference or training, the model automatically processes the first frame of the input video and feeds it into the EVA model.

You can refer to the relevant part of the code here for implementation details: https://github.com/ZebangCheng/Emotion-LLaMA/blob/35b09357075cd5ee4c804d686680288ff23f55db/minigpt4/models/eva_vit.py#L415-L442

https://github.com/ZebangCheng/Emotion-LLaMA/blob/35b09357075cd5ee4c804d686680288ff23f55db/minigpt4/models/minigpt_v2.py#L92-L116

Let me know if you need help locating the specific code or modifying it for your use.

May 15 '25 14:05 ZebangCheng

请问，为什么没有 MAE 和 videoMAE的代码，您是怎么使用 MAE 和 videoMAE的呢？

Jun 03 '25 02:06 jianghaisi

具体的特征提取代码可以在以下google driver获取： https://drive.google.com/drive/folders/1WpQBV7XQsGnLr6B7bv4kKn4suW-o8fWO?usp=sharing

由于我在提取特征时，忘记设置随机种子，导致后续我们按照同样的步骤提取的特征也会和实际用的特征有一点点差别。如果想完全复现我们在论文中的实验结果，请使用我们提取好的特征： https://drive.google.com/drive/folders/1Atm7x_J4OQsBQ32vvi-c2oM3m3P07WTF?usp=sharing

如果想用在其它数据集，可以按照代码提取对应的特征。

Jun 04 '25 08:06 ZebangCheng

非常感谢您的回答。请问您MAE和VideoMAE是用openface裁剪好的图片，然后EAV用的是视频（没有被裁剪）的第一帧吗？我有个疑问，为什么在 def encode_img(self, image, video_features): # device = 'cuda:0' device = image.device if len(image.shape) > 4: image = image.reshape(-1, *image.shape[-3:]) with self.maybe_autocast(): image_feats = self.visual_encoder(image) # [1, 1025, 1408] image_embeds = self.ln_vision(image_feats).to(device) # [1, 1025, 1408] image_cls_tk = image_embeds[:, :1, :] # [1, 1, 1408] cls_tk_feats = self.cls_tk_llama_proj(image_cls_tk) # [1, 1, 4096] image_embeds = image_embeds[:, 1:, :] # [1, 1024, 1408] bs, pn, hs = image_embeds.shape
image_embeds = image_embeds.view(bs, int(pn / 4), int(hs * 4)) # [1, 256, 5632] image_inputs_llama = self.llama_proj(image_embeds) # [1, 256, 4096] video_features = video_features.to(device) # [1, 3, 1024] video_features_split = torch.split(video_features, 1, dim=1) output1 = self.feats_llama_proj1(video_features_split[0].squeeze(1)) output2 = self.feats_llama_proj2(video_features_split[1].squeeze(1)) output3 = self.feats_llama_proj3(video_features_split[2].squeeze(1))
video_feats = torch.stack([output1, output2, output3], dim=1) inputs_llama = torch.cat((image_inputs_llama, video_feats, cls_tk_feats), dim=1) # cls_tk_feats # inputs_llama = torch.cat((image_inputs_llama, video_feats), dim=1)

        atts_llama = torch.ones(inputs_llama.size()[:-1], dtype=torch.long).to(image.device)    
    return inputs_llama, atts_llama

image_inputs_llama的特征这么大，会不会导致模型其实大部分学习的都是视频第一帧的信息。

Jun 04 '25 08:06 jianghaisi

我们一开始也是认为EVA的特征太大，会影响其它特征的学习，所以就只使用EVA的Class token（image_cls_tk = image_embeds[:, :1, :] # [1, 1, 1408]）和其它特征融合输入到大模型，但是效果较差。后面我们保留了EVA的所有特征（image_inputs_llama = self.llama_proj(image_embeds) # [1, 256, 4096]）和其它特征一起输入到大模型，效果很好，超出了我们的预期。我们认为原本的EVA特征对应的是大模型的world knowledge，然后我们新添加的特征是professional knowledge。最后我们前期通过简单的消融实验，只包含EVA特征，其它特征设置为0向量训练模型，测试结果较差。所以这样的组合是我们的实验效果最好的。

Jun 06 '25 06:06 ZebangCheng