Emotion-LLaMA
Emotion-LLaMA copied to clipboard
Is it necessary to pre-extract features for evaluation?
Thank you for your great work on Emotion-LLaMA. I’ve been using your released codebase to fine-tune and evaluate the model, and I appreciate the clarity of your instructions.
I have a question regarding the evaluation process.
Currently, I have my own dataset consisting of .mp4 video files and a corresponding .txt file with emotion labels (e.g., video_name frame_count emotion -10). I would like to evaluate Emotion-LLaMA on this dataset.
However, when I use the provided FeatureFaceDataset, the script attempts to load pre-extracted .npy feature files (e.g., from mae_340_UTT, hubert, etc.), and throws a FileNotFoundError since my dataset only has raw videos and no extracted features.
I understand that the original evaluation uses precomputed features for consistency. But I would like to confirm:
Is it absolutely necessary to extract FaceMAE / VideoMAE / Audio features in advance in order to evaluate Emotion-LLaMA? Or is there any recommended way to evaluate the model directly on raw .mp4 videos by letting the model encode the features during evaluation?
If evaluation on raw videos is not supported in the current repo, I would greatly appreciate your suggestions or guidance on how to adapt the codebase for this purpose.
Thank you again for your excellent work and support.
Thank you for your kind words and for using Emotion-LLaMA!
To answer your question:
Yes, we strongly recommend extracting .npy features in advance for evaluation, as this setup is designed to fully test Emotion-LLaMA’s capabilities using all encoders, including the Global Encoder (EVA), Local Encoder (MAE), Temporal Encoder (VideoMAE), and Audio Encoder (HuBERT).
However, to make things more convenient, we’ve also provided an API-based evaluation method that supports direct input of raw .mp4 video files. You can simply provide a video path and a question, and the model will return a response without needing pre-extracted features.
You can find the instructions here:
👉 https://github.com/ZebangCheng/Emotion-LLaMA/blob/main/api_en.md
Note: This API-based method uses only the Local Encoder and Audio Encoder, so its performance may be lower than the full-feature evaluation.
My suggestion is to use the second (API-based) method for quick testing in the early stages of your experiments. Once you're ready for a more rigorous evaluation, you can switch to extracting .npy features for the full setup.
Side note: Interestingly, in another GitHub issue, a researcher reported that the API-based method actually outperformed the full pipeline on their custom dataset—so you might find it worthwhile to try both!
Let me know if you need help adapting the code.
Thanks for the detailed explanation! That makes sense.
As a follow-up: I noticed that eval_emotion.py doesn’t seem to use audio features like HuBERT during evaluation, even though audio is supported in the API-based inference and was used during training.
Is there a reason HuBERT (or audio features in general) were omitted from eval_emotion.py? I’d like to include audio-based evaluation as well for my experiments using .npy features.
Would it be okay to modify the current script to add audio features manually into the evaluation loop, or is there a recommended way to integrate audio properly for batch evaluation?
Thanks again
Here is your refined and translated response:
Thank you for your follow-up question!
In fact, the eval_emotion.py script does include HuBERT audio features—they are part of the video_features variable. For example, in the evaluation loop:
https://github.com/ZebangCheng/Emotion-LLaMA/blob/35b09357075cd5ee4c804d686680288ff23f55db/eval_emotion.py#L73-L78
Here, video_features contains the concatenated .npy features from multiple encoders, including HuBERT for audio, as well as MAE and VideoMAE for visual information.