Emotion-LLaMA Feature Extraction for Training with a Personal Dataset

#48

You can follow the tutorial to pre-extract the features and then use the test script to run inference with the full Emotion-LLaMA model.

Our next goal is to develop a more advanced end-to-end model that can be applied directly. If we make progress, we will create a new repository to share the updates.

I would like to apply my own personal dataset for training (fine-tuning).

Where can I find the tutorial mentioned in "You can follow the tutorial to pre-extract the features"?

"Our next goal is to develop a more advanced end-to-end model that can be applied directly. If we make progress, we will create a new repository to share the updates." -> How is this part going?

Apr 11 '25 02:04 00dbgpdnjs

Apologies, progress has been slower than anticipated. Recently, our research efforts have focused on integrating with related projects within our research group. We experimented with several approaches, but the results were suboptimal.

However, we've initiated some new strategies and anticipate releasing the updated work by the end of May.

Apr 11 '25 07:04 ZebangCheng

Thank you for your reply.

When you said, "You can follow the tutorial to pre-extract the features," (#48 ) did you mean that there is no tutorial available?

Also, I am trying to evaluate my dataset and then run train.py to fine-tune the already fine-tuned model on my own dataset. I'm quite concerned because I need to extract the features first. It seems like the explanation about feature extraction is scattered across different places. I'm planning to proceed step by step for now, but could you possibly provide any additional advice on this part?

Apr 11 '25 13:04 00dbgpdnjs

I will refer to issue #32.

I’m trying to create my dataset in the same format as MERR for evaluation and subsequent fine-tuning.
In this process, how and at which stage should OpenFace be applied?
Is the open face code used in the project released?
Should AUs be extracted from every frame?
And after accumulating the AU values from all frames, is it correct that only a single peak frame is selected and the rest of the frames become unnecessary?
Also, should I choose the top two AUs from the peak frame to create annotations similar to those in MERR?
I have a question regarding the demo as well — Would using a middle frame instead of the first frame result in higher accuracy?

Apr 12 '25 06:04 00dbgpdnjs

Thank you for your reply.

When you said, "You can follow the tutorial to pre-extract the features," (#48 ) did you mean that there is no tutorial available?

Also, I am trying to evaluate my dataset and then run train.py to fine-tune the already fine-tuned model on my own dataset. I'm quite concerned because I need to extract the features first. It seems like the explanation about feature extraction is scattered across different places. I'm planning to proceed step by step for now, but could you possibly provide any additional advice on this part?

You can refer to the link to extract features：

https://drive.google.com/drive/folders/1WpQBV7XQsGnLr6B7bv4kKn4suW-o8fWO?usp=sharing

Apr 14 '25 06:04 ZebangCheng

I will refer to issue #32.

I’m trying to create my dataset in the same format as MERR for evaluation and subsequent fine-tuning.

In this process, how and at which stage should OpenFace be applied?

Is the open face code used in the project released?

Should AUs be extracted from every frame?

And after accumulating the AU values from all frames, is it correct that only a single peak frame is selected and the rest of the frames become unnecessary?

Also, should I choose the top two AUs from the peak frame to create annotations similar to those in MERR?

I have a question regarding the demo as well — Would using a middle frame instead of the first frame result in higher accuracy?

OpenFace should be used during the data preprocessing stage. Specifically, we first apply OpenFace to extract facial regions and Action Units (AUs) from the video, which are then used to guide both the extraction of visual features and the generation of AU-based visual label descriptions.

The code for using OpenFace is adapted from the MER2023 baseline GitHub repository. In our pipeline, AUs are extracted from every frame. After collecting AU values across all frames, we select a single peak frame—i.e., the one with the highest overall AU activation—for generating facial expression descriptions. However, the remaining frames are not discarded; they are retained for extracting visual features using models such as MAE and VideoMAE.

As for selecting the top two AUs from the peak frame, this approach can be problematic. For example, during normal speech, mouth-related AUs may dominate due to movement, which could obscure more nuanced emotional cues. To mitigate this, we suggest using emotion-specific AU attention—e.g., for the “happy” category, we focus on AUs like “AU06”, “AU12”, and “AU14”. Alternatively, AU values can be normalized to reduce bias from dominant facial motions.

Regarding the demo: I believe using a middle frame instead of the first frame can indeed improve performance. We've observed that the beginning of many videos often contains transitions or blank frames where no face is visible. If a lightweight keyframe detection algorithm is available, it could further enhance the model’s robustness and accuracy.

Apr 14 '25 06:04 ZebangCheng

Thank you so much!

I have a question regarding extract_mae_embedding.py in Google Drive.

Inside the openface_face folder (e.g., /home/amax/big_space/datasets/DFEW/dataset-process/openface_face), should I store all frames for each subdirectory (e.g., for sample_00005561.mp4)?

openface_face/
├── sample_00005561.mp4/
│   ├── frame_0001.jpg
│   ├── frame_0002.jpg
│   ├── ...
├── sample_00005562.mp4/
│   ├── frame_0001.jpg
│   ├── ...

Apr 14 '25 09:04 00dbgpdnjs

Yes, that's correct. In the initial preprocessing stage, we need to retain all the extracted face frames from each video to perform feature extraction.

Apr 14 '25 13:04 ZebangCheng

Thanks for your quick response!

I’m planning to perform additional training based on your final model, which has already been fine-tuned (second-stage training). In this case, should I follow the second-stage training process again using Emotion_LLaMA.pth?
Since I intend to use the model for classification (emotion recognition), would it be better to only perform classification training?
If I'm only training for classification, do I only need emotion labels and not annotations?
There are several .pth files available — which one would you recommend for my purpose? (e.g., MER2024-best.pth, Emotion_LLaMA.pth, checkpoint_best.pth, etc.)
What’s the difference between Emotion_LLaMA.pth and checkpoint_best.pth?
Is it absolutely necessary to use the MER2023 baseline for data preprocessing? Setting up CUDA 10.2 seems quite complicated. Instead, I’m considering using the Linux version of OpenFace to extract AUs for annotation, and to crop faces from every frame of my dataset videos, saving them into the openface_face/ directory. For audio feature extraction, I plan to follow the demo...

Apr 15 '25 01:04 00dbgpdnjs

If your goal is emotion classification (emotion recognition), I would recommend focusing directly on training for classification. Additionally, it's a good idea to experiment with different checkpoints — such as Emotion_LLaMA.pth, MER2023_best.pth, and MiniGPT-v2.pth — on your target dataset to determine which model performs best for your specific scenario.

To clarify:

Emotion_LLaMA.pth is a checkpoint obtained from multitask training, designed to handle both emotion reasoning and recognition. It generally offers strong generalization capabilities across tasks.
In contrast, the *_best.pth models (e.g., MER2023_best.pth) are specifically optimized for classification and tend to perform best on their respective datasets.

Regarding data preprocessing, you’re not required to follow the MER2023 baseline pipeline strictly. If you're using datasets like MER2023 or MER2024, we recommend following the official preprocessing pipeline for consistency. However, for other datasets, it's generally better to extract features using the encoder that works best on that dataset — this often results in improved performance.

In my case, I usually use the Windows version of OpenFace for facial feature extraction — it's simple and convenient. The Linux version can be a bit more challenging to deploy. Fortunately, one of our peer researchers has addressed this and released a multi-threaded OpenFace extraction script. You may find the following resources helpful:

# Some notes and setup tips:
https://github.com/KTTRCDL/MEIJU2025?tab=readme-ov-file#some-tips

# Example shell command for extraction:
https://github.com/KTTRCDL/MEIJU2025/blob/main/script/FeatureExtract/step2_visual.sh

# Multi-process implementation (Python):
https://github.com/KTTRCDL/MEIJU2025/blob/main/feature_extraction/visual/extract_openface_ubuntu_multiprocess.py

Hope this helps! Feel free to reach out if you have further questions.

Apr 15 '25 07:04 ZebangCheng

Thank you for your reply.

I have a question that may have been missed.

If I only want to fine-tune for a recognition task, then I wouldn’t need to extract AUs or descriptions — in that case, is the annotation JSON file (e.g., MERR_fine_grained.json) unnecessary, and only the TXT file (e.g., MERR_fine_grained.txt) required?

Apr 16 '25 01:04 00dbgpdnjs

Yes, you're absolutely right.

If you're only fine-tuning for the recognition task, you don't need the annotation JSON file (e.g., MERR_fine_grained.json). Only the TXT file containing the labels (e.g., MERR_fine_grained.txt) is required.

Just make sure to comment out or skip any code that tries to load MERR_fine_grained.json to avoid potential errors during training.

Apr 16 '25 03:04 ZebangCheng

Is the 'V' ever used in the NCEV txt files? If not, may I ask why the 'V' was created?
In issue #49, it was mentioned that 16 frames are sampled for both the local and video encoders. However, while the _get_test_indices method is used in the video encoder, it doesn't seem to be used in the local encoder.
Did you keep the OpenFace default output format as BMP, or did you convert it to JPG?
When extracting faces using OpenFace, did you exclude any videos that fall under the following exceptions?
- Only the eyes and nose are detected (=The mouth is excluded) (or Only half of the face is captured.)
- Not only the face but also part of the background is slightly included.
- The extracted BMPs from a single video are rotated at inconsistent angles - For example, in a video where a person is standing upright, some specific BMP frames appear to be rotated by around 90 to 100 degrees, as if the original video were tilted or lying on its side.
- Black frames are saved due to face detection failure
face extraction : Did you modify any options such as -scale 2.0 to enlarge the frames, or did you use the baseline code as is?

Apr 16 '25 09:04 00dbgpdnjs

The MER2023 dataset includes three tasks: MER-Multi, MER-Noise, and MER-Semi. The first two tasks utilize the sentiment value (V), whereas MER-Semi does not. However, to maintain a consistent file format across all tasks, we assigned a default value of -10 for V in MER-Semi, which carries no actual meaning.
Both the local and video encoders use 16 sampled frames during training, but in slightly different ways. For the local encoder, one frame is randomly selected from the 16 during training. When extracting features, however, all 16 frames are used, and their features are averaged.
We kept the default BMP output format when using OpenFace for face extraction.
We did not exclude any videos based on the exceptions mentioned. All extracted frames were retained as-is.
We did not modify any OpenFace parameters during face extraction. The process was run entirely with the default settings.

Apr 18 '25 13:04 ZebangCheng

Q1) You mentioned that local features are averaged across 16 frames. Does that mean the averaged features are not used during training? Wasn't the purpose of averaging the features from 16 frames to use them for training?

Q2) You mentioned that some of the face extractions were incorrect. Could you clarify approximately what percentage of the extracted faces were incorrect?

Q3) When I evaluate my dataset (only recognition) using eval_emotion.py (worse) instead of app_EmotionLlamaClient.py (better), the performance drops by over 10%. Is there anything wrong with the feature extraction or the evaluation process in eval_emotion.py?

*app_EmotionLlamaClient.py was converted into a test script

aduio a. python main-baseline.py split_audio_from_video_16k './dataset-process/video' './dataset-process/audio' b. python -u extract_transformers_embedding.py --dataset='mydata' --feature_level='UTTERANCE' --model_name='chinese-hubert-large' --gpu=0
openface : python extract_openface.py --dataset=mydata --type=videoOne

MAE python -u extract_mae_embedding.py --dataset='mydata' --feature_level='UTTERANCE' --device='cuda:0' --pretrain_model='mae_checkpoint-340' --feature_name='mae_checkpoint-340' terminal :

 (llama) (base) root@f0f71d668f7e:/workspace/Emotion-LLaMA# /opt/conda/envs/llama/bin/python /workspace/Emotion-LLaMA/preprocess/feature_extract/extract_mae_embedding.py
==> Extracting mae embedding...
Load pre-trained checkpoint from: /Dataset/Emotion-LLaMA/feature_extract/models/mae_checkpoint-340.pth
['head.weight', 'head.bias']
Find total "126" videos.
Processing video '2025-03-26_17-18-27_s12' (1/126)...
embedding shape:  torch.Size([68, 1024])
Processing video '2025-03-26_15-03-05_s05' (2/126)...
embedding shape:  torch.Size([106, 1024])

maeVideo python -u extract_maeVideo_embedding.py --dataset='mydata' --feature_level='UTTERANCE' --device='cuda:0' --pretrain_model='maeVideo_ckp199' --feature_name='maeVideo’

terminal :

Processing video ' (126/126)...
embedding : torch.Size([1, 1568, 1024])
csv_file:  /Dataset/Emotion-LLaMA/indj/dataset-process/features_all_no_noise/maeV_399_UTT/20250326_183816_s18.npy
embedding:  [-0.18795264 -0.39050153 -0.23210019 ...  1.2052747   0.8692431
  0.1615461 ]

eval_emotion.py a. Select one prompt in self.emotion_instruction_pool that is the same as the question used by app_EmotionLamaClient.py b. torchrun --nproc_per_node 1 eval_emotion.py --cfg-path eval_configs/eval_emotion.yaml --dataset feature_face_caption

*Both app_EmotionLlamaClient.py and eval_emotion.py used the same pth (stage2/checkpoint_best.pth)

terminal :

(llama) (base) root@f0f71d668f7e:/workspace/Emotion-LLaMA# torchrun --nproc_per_node 1 eval_emotion.py --cfg-path eval_configs/eval_emotion.yaml --dataset feature_face_caption

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
cfg: Namespace(cfg_path='eval_configs/eval_emotion.yaml', name='A2', ckpt=None, eval_opt='all', max_new_tokens=10, batch_size=32, lora_r=64, lora_alpha=16, options=None, dataset=['feature_face_caption'], res=100.0, resample=False)
Initialization Model
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.01s/it]
loraconfig: LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, base_model_name_or_path=None, task_type='CAUSAL_LM', inference_mode=False, r=64, target_modules=['q_proj', 'v_proj'], lora_alpha=16, lora_dropout=0.05, merge_weights=False, fan_in_fan_out=False, enable_lora=None, bias='none', modules_to_save=None)
trainable params: 33554432 || all params: 6771970048 || trainable%: 0.49548996469513035
Position interpolate from 16x16 to 32x32
Load Minigpt-4-LLM Checkpoint: /Dataset/Emotion-LLaMA/checkpoints/save_checkpoint/stage2/checkpoint_best.pth
Initialization Finished
['feature_face_caption']
/Dataset/Emotion-LLaMA/dataset-process/features_top_suc/top_suc_NCE.txt
/Dataset/driver_MER/cut_videos
1
500
self.emotion_instruction_pool : ['Please determine which emotion label in the video represents: happy, sad, neutral, angry, fear, surprise.']
ann_path:  /Dataset/Emotion-LLaMA/dataset-process/features_top_suc/top_suc_NCE.txt
video number:126
{'image': tensor([[[-0.3616, -0.0988,  0.2077,  ..., -0.3178, -0.3178, -0.3032],
         [ 0.2077,  0.2223,  0.2223,  ..., -0.3178, -0.3324, -0.3178],
         [ 0.2369,  0.1201,  0.0471,  ..., -0.3178, -0.3324, -0.3324],
         ...,
         [-1.7193, -1.7339, -1.7339,  ..., -1.2667, -1.3105, -1.2667],
         [-1.7193, -1.7339, -1.7339,  ..., -1.3105, -1.3251, -1.2959],
         [-1.7193, -1.7339, -1.7339,  ..., -1.3397, -1.3397, -1.3105]],

        [[-0.2063,  0.0638,  0.3790,  ..., -0.2363, -0.2363, -0.2213],
         [ 0.3790,  0.3940,  0.3940,  ..., -0.2363, -0.2513, -0.2363],
         [ 0.4090,  0.2890,  0.2139,  ..., -0.2363, -0.2513, -0.2513],
         ...,
         [-1.6170, -1.6320, -1.6320,  ..., -1.1368, -1.1818, -1.1368],
         [-1.6170, -1.6320, -1.6320,  ..., -1.1818, -1.1968, -1.1668],
         [-1.6170, -1.6320, -1.6320,  ..., -1.2118, -1.2118, -1.1818]],

        [[-0.0298,  0.2262,  0.5248,  ..., -0.0867, -0.0867, -0.0724],
         [ 0.5248,  0.5390,  0.5390,  ..., -0.0867, -0.1009, -0.0867],
         [ 0.5532,  0.4395,  0.3684,  ..., -0.0867, -0.1009, -0.1009],
         ...,
         [-1.2954, -1.3096, -1.3096,  ..., -0.8119, -0.8545, -0.8119],
         [-1.2954, -1.3096, -1.3096,  ..., -0.8545, -0.8688, -0.8403],
         [-1.2954, -1.3096, -1.3096,  ..., -0.8830, -0.8830, -0.8545]]]), 'video_features': tensor([[ 0.0627, -0.2266, -0.2087,  ..., -0.2067, -0.2589, -0.4053],
        [ 0.2806, -0.0660, -0.8707,  ...,  0.6138, -0.8159, -0.3015],
        [-0.0739,  0.0236,  0.0413,  ..., -0.0317,  0.0624, -0.0318]]), 'instruction_input': "<video><VideoHere></video> <feature><FeatureHere></feature> The person in video says: I guess I have to head to work early again today. Hope the traffic isn't too bad..  [emotion] Please determine which emotion label in the video represents: happy, sad, neutral, angry, fear, surprise. ", 'answer': 'neutral', 'emotion': 0, 'image_id': '2025-03-26_13-18-55_s01'}
Accuracy: 0.6428571428571429
Precision: 0.731082251082251
Recall: 0.6428571428571429
F1 Score: 0.6189390067297044
[[24  1  0  1  1  0]
 [ 0 14  0  2  2  0]
 [ 5  0 12 10  0  0]
 [ 2  2  0 23  0  0]
 [ 2  0  0  9  7  0]
 [ 0  8  0  0  0  1]]

Apr 18 '25 15:04 00dbgpdnjs

A1) Yes, the averaged features are not used during the MAE training. Only individual frame features are used for training purposes.

A2) We did not quantitatively evaluate or record the number or proportion of incorrectly extracted face frames by OpenFace.

A3) In theory, the performance should be consistent. Could you try comparing the extracted features from both evaluation scripts? For example, run both evaluations on the same sample and compare the video_features outputs to see if they are identical.

Finally, I’m truly impressed by your code. Your skills are remarkable—you’ve managed to streamline the video sample testing process in app_EmotionLlamaClient.py. I hope we’ll have more opportunities to exchange ideas in the future.

Apr 25 '25 12:04 ZebangCheng

Thank you for your response.

Following up on Q3: Since app_EmotionLlamaClient.py and eval_emotion.py have different inference processes, does it make sense to compare the video_features results?

Apr 29 '25 01:04 00dbgpdnjs

You are absolutely right. The inference processes in app_EmotionLlamaClient.py and eval_emotion.py are indeed different.
For simplicity, the inference code in app_EmotionLlamaClient.py does not use video features or local features.
In contrast, eval_emotion.py incorporates more features, and theoretically, it should achieve better performance.
However, possibly due to differences in data distribution, when using eval_emotion.py (which uses more features) instead of app_EmotionLlamaClient.py (which uses fewer features), the performance actually drops by more than 10%.

Apr 29 '25 03:04 ZebangCheng

Q1) Which part should I modify to perform validation during training?

Q2) (https://github.com/ZebangCheng/Emotion-LLaMA/blob/main/minigpt4/datasets/datasets/first_face.py#L153) character_line = "The person in video says: {}. ".format(sentence)

If you add a period at the end of character_line, haven’t you encountered cases where instruction_input ends with two periods, like "sending home.." ?

{'image': tensor([[[-1.4565, -1.4273, -1.4273,  ...,  0.3245,  0.3245,  0.4413],
         ...,
         [ 0.3257,  0.3399,  0.3399,  ..., -0.4564, -0.4848, -0.5133]]]), 'video_features': tensor([[-0.1237, -0.1887,  0.0262,  ...,  0.0556, -0.3504, -0.2844],
        [ 0.1294,  0.3599, -0.6022,  ...,  1.3981,  0.3577,  0.6850],
        [-0.0451,  0.0526, -0.0098,  ...,  0.0083,  0.0270, -0.0096]]), 
'instruction_input': '<video><VideoHere></video> <feature><FeatureHere></feature> The person in video says: Work study is set at a maximum of 2100 dollars, so it is not a vehicle that should be used to manage the family contribution really nor is it meant for students to be sending home..  [emotion] Please determine which emotion label in the video represents: happy, sad, neutral, angry, fear, surprise. ', 'answer': 'neutral', 'emotion': 0, 'image_id': 'ssqiwP19JhM_92_9_98_5'}

Apr 30 '25 10:04 00dbgpdnjs

A1) I'm not very familiar with performing validation during training, as I usually monitor the loss values instead. However, you can refer to the following part of the code for relevant logic:

https://github.com/ZebangCheng/Emotion-LLaMA/blob/35b09357075cd5ee4c804d686680288ff23f55db/minigpt4/tasks/image_text_pretrain.py#L12-L19

A2) Yes, cases with double periods like "sending home.." do occur, but I haven't paid much attention to this issue. In my experience, the potential problems caused by missing punctuation are more severe than having two periods. That’s why I chose to always append a period at the end. Of course, you can easily write a helper function to handle this more gracefully. Here's an example:

def ensure_period(sentence):
    if sentence and sentence[-1] not in ".!?":
        return sentence + "."
    return sentence

This function checks whether the sentence ends with a proper punctuation mark and adds a period if not.

May 06 '25 11:05 ZebangCheng

It seems that checking the loss value on the validation dataset would make it easier to determine whether the training is going well — isn’t that related to the code below?

config.py

validator.add_argument(
    "valid_splits",
    type=list,
    help="Splits to use for validation. If not provided, will skip the validation.",
)

May 06 '25 14:05 00dbgpdnjs

You're absolutely right — monitoring the loss on the validation set is indeed much more informative than just looking at the training loss. However, for the datasets we focus on, such as MER2023 and DFEW, there are no dedicated validation splits. These datasets are typically used with k-fold cross-validation.

Since training large models requires a significant amount of data, manually setting aside a portion of the data for validation may negatively impact overall training performance, in my opinion.

I'm also not very familiar with the code related to the validation set.

May 07 '25 01:05 ZebangCheng

Q1) When training both the reasoning and recognition tasks simultaneously, did you choose to use all five prompts per task because the performance was higher compared to using only one prompt per task?

ex) “Please determine which emotion label in the video represents: happy, sad, neutral, angry, worried, surprise, fear, contempt, doubt.”

Q2) I expected that using classification-only models like mer2024_best.pth or checkpoint_best.pth would yield better results when testing on my emotion recognition dataset. However, I found that Emotion_LLaMA.pth, which was trained on both classification and reasoning tasks, actually performed better. As I understand, Emotion_LLaMA.pth is the fine-tuned result mentioned in the paper. Could you clarify exactly which datasets were used to fine-tune Emotion_LLaMA.pth? I guess mer2024-semi, mer2024-noise, all mer2024 unlabled, DFEW etc

Q3) Why did you train both recognition and reasoning tasks

Q4) You mentioned that the models ending with "best" (e.g., MER2023_best.pth, checkpoint_best.pth, MER2024-best) were trained only on the emotion recognition task. In that case, wouldn’t stage 2 training be impossible with those models? Why did you proceed with stage 2 training then?

Q5) Why did you create your own MERR dataset through a custom process, even though a reason CSV file was already provided with the dataset?

Q6) t also include the 2025 data. right? I honestly didn’t expect there to be any new data added for 2025 https://drive.google.com/drive/folders/1DqGSBgpRo7TuGNqMJo9BYg6smJE20MG4

thank you

May 12 '25 09:05 00dbgpdnjs

A1) When aiming to improve the model's generalization ability, we train both the reasoning and recognition tasks simultaneously and utilize five diverse prompts per task. This helps the model handle varied phrasings and instructions. However, when the goal is to optimize performance on a specific dataset (e.g., MER2023-SEMI), we focus solely on the recognition task and use only one prompt to avoid overfitting to specific formulations.

A2) You're absolutely right—Emotion_LLaMA.pth was fine-tuned jointly on classification and reasoning tasks, and its superior performance on your unseen dataset is expected. This model demonstrates better generalization compared to checkpoint_best.pth or mer2024_best.pth, which are more specialized and may be overfitted to MER2023 and MER2024 datasets, respectively.

A3) We trained the model on both recognition and reasoning tasks to enhance generalization and support multimodal understanding. Emotion is complex and multimodal in nature, so reasoning about the cause of emotion helps the model learn richer representations and make more accurate predictions across diverse scenarios.

A4) It seems there might be a small misunderstanding. When we aim for generalization, both Stage 1 and Stage 2 involve simultaneous training of recognition and reasoning tasks. However, when optimizing for task-specific performance on a known dataset, both Stage 1 and Stage 2 are used only for recognition. The difference between Stage 1 and Stage 2 lies in the data granularity—Stage 2 uses higher-quality and more refined samples.

A5) If by "a reason CSV" you mean an existing file with pre-defined explanations: during our research, no instruction-style multimodal emotion reasoning dataset existed. We are the first to propose automatic annotation of unlabeled videos through a structured pipeline combining OpenFace, MiniGPT-v2, Qwen-Audio, and LLaMA-3. Our MERR dataset was constructed to fill this gap and has since been cited and compared in subsequent work.

A6) This question is a bit unclear, but I’ll try to address it: our released features cover the MER2023 and MER2024 datasets. MER2025 is a newly organized competition that includes selected samples from MER2023 and MER2024, but we are not releasing new video features specifically for MER2025.

May 14 '25 11:05 ZebangCheng

Q1) I downloaded the MER2025 dataset, but I noticed that the video filenames listed in MERR_coarse_grained and MERR_fine_grained do not match any of the video files in the mer2025 video directory. Do you know why there is no overlap?

Q2) When changing --nproc_per_node=4 to --nproc_per_node=8 to use 8 GPUs during training, are there any parameters that need to be adjusted? For example, iters_per_epoch or batch_size.

May 16 '25 01:05 00dbgpdnjs

A1) Sorry, I'm not sure about the exact reason. It's possible that the MER2025 dataset has adopted a new video naming convention, which makes it difficult to directly match the videos with those from MER2023 or the ones referenced in the MERR dataset.

A2) There are no other parameters that require special adjustments when changing --nproc_per_node from 4 to 8. Generally, increasing iters_per_epoch or batch_size is beneficial, but it's recommended to start with the default settings and make adjustments later based on your specific training performance.

May 16 '25 08:05 ZebangCheng

In the paper, it states that "The tuning process is extended to diverse sources, including MER2023 [59] and DFEW" as part of the Multimodal Instruction Tuning. However, I found that all 28,618 coarse-grained samples and 4,487 fine-grained samples were located in the MER2023/test3/ folder.

Q1) Did you only use the test3/ subset from MER2023, without using DFEW, contrary to what the paper states?

Q2) I'm also curious why the videos from MER2023/train/ were not used.

May 20 '25 01:05 00dbgpdnjs

A1) You can refer to our evaluation results on the DFEW test set, which are presented in two parts: zero-shot and fine-tuning.

The zero-shot results are obtained by testing the Emotion-LLaMA model—trained only on MER2023 samples—directly on the DFEW test set without any fine-tuning.
The fine-tuning results refer to the model being trained on the DFEW training set and then evaluated on the DFEW test set.

So yes, DFEW was indeed used as part of the multimodal instruction tuning and evaluation, as stated in the paper.

A2) In the early stages of our work, we used the MER2023 training set to train a baseline model. However, in later stages, we observed that the distribution of the pseudo-labeled data was highly consistent with that of the training set. To prevent redundancy and overfitting, we chose not to include the original MER2023 training set in the final tuning phase.

May 21 '25 02:05 ZebangCheng

For training in stage1 and stage2, did you use 9 emotion categories for both stages? Or did you use 9 categories only in stage2, and limit stage1 to 6 categories (happy, sad, neutral, angry, worried, surprise)?

self.emotion_instruction_pool = [
“Please determine which emotion label in the video represents: happy, sad, neutral,
angry, worried, surprise, fear, contempt, doubt.”,
 “Identify the displayed emotion in the video: is it happy, sad, neutral, angry, worried,
or surprise, fear, contempt, doubt?”,
“Determine the emotional state shown in the video, choosing from happy, sad, neutral, angry, worried, surprise, fear, contempt or doubt.”,
“Please ascertain the specific emotion portrayed in the video, whether it be happy, sad, neutral, angry, worried, surprise, fear, contempt or doubt.”,
“Assess and label the emotion evident in the video: could it be happy, sad, neutral,
angry, worried, surprise, fear, contempt, doubt?”
]

May 22 '25 03:05 00dbgpdnjs

During both Stage 1 and Stage 2, we used prompts with all 9 emotion categories.

However, in Stage 2, the actual training samples only included 6 categories (happy, sad, neutral, angry, worried, surprise)—there were no samples labeled as fear, contempt, or doubt. Despite this, we still used the 9-class prompts.

This decision was based on our experiments: we compared performance using 6-class prompts vs. 9-class prompts during Stage 2, and found that the 9-class prompt consistently yielded better results, even though some categories were not present in the training data.

May 22 '25 07:05 ZebangCheng

As far as I could tell, the stage 1 (MERR_coarse_grained.txt) only included 6 emotions—am I mistaken?

The GitHub code below was confusing because the five recognition prompts used during training included only six emotions. https://github.com/ZebangCheng/Emotion-LLaMA/blob/main/minigpt4/datasets/datasets/first_face.py#L43

Q2) In the current issue (#71), it is mentioned that both recognition and reasoning tasks were trained simultaneously in stages 1 and 2.

stage 1 :

self.task_pool = [
   "emotion",
   "reason",
]

stage 2 :

self.task_pool = [
   "emotion",
   "reason_v2",
]

However, in stage 2, according to this documentation, only reasoning is trained.

self.task_pool = [
    "reason_v2",
]

Which approach is correct if I want to reproduce Emotion_LLaMA.pth?

Q3) Why is the following code necessary during testing? (caption = "") Isn't it enough to just specify only 'emotion' in the task?

https://github.com/ZebangCheng/Emotion-LLaMA/blob/main/minigpt4/datasets/datasets/first_face.py#L145

Q4) Is the F1 Score: 0.9035772132511264 based on the evaluation of relative_test3_NCEV.txt using the model at Emotion-LLaMA/checkpoints/save_checkpoint/stage2/checkpoint_best.pth with only "Please determine which emotion label in the video represents: happy, sad, neutral, angry, worried, surprise." prompt?

May 22 '25 08:05 00dbgpdnjs

Emotion-LLaMA Emotion-LLaMA copied to clipboard

Feature Extraction for Training with a Personal Dataset

Emotion-LLaMA
Emotion-LLaMA copied to clipboard