Emotion-LLaMA
Emotion-LLaMA copied to clipboard
hello author, i want ask that whether the file "checkpoint_best.pth" is provided ?
https://drive.google.com/file/d/1NoPZDj5_392zBtVK1IHO8bepA4910iI_/view?usp=sharing
You can download the checkpoint through the above link. If you encounter any issues, feel free to continue the discussion.
Thanks a lot, but i still have some questions: 1.when i run this code, i fund that there was a txt file, which is "/home/user/selected_face/face_emotion/relative_test3_NCEV.txt".how could i get it? And what is the data structure of it? 2.Is that EVA modal equal to the "vis_processor", which should input 448×448 images?And i notice that "image = self.vis_processor(image)",how could i prepare these imgs?
For both of the issues you mentioned regarding not finding the input images, you should download the dataset, preprocess the data, and then set the correct file paths. For video files, the simplest way is to extract a frame from the video as the input image.
Here, the relative_test3_NCEV dataset refers to the MER2023-SEMI dataset, which contains over 800 samples. The relative_test3_NCEV.txt file contains the sample names and annotation information for the dataset you want to test. The data format is as follows:
samplenew_00006611 125 sad -10
samplenew_00006666 205 angry -10
samplenew_00006851 121 sad -10
samplenew_00008033 160 sad -10
samplenew_00009034 237 sad -10
samplenew_00009099 213 happy -10
If you just want to quickly reproduce the results, I can organize the features extracted from the MER2023-SIME dataset and share them with you.
If you want to fine-tune on another dataset, you'll need to extract audio and visual features yourself, and then set the correct paths to run the code. For more information on how to extract features from videos, you can refer to the following project: MER2023, MERTools
Wow,i am so glad to hear that you can share the features extracted from the MER2023-SIME dataset with me, i really want to quickly reproduce the results, i guess the results must be great. Thanks a lot again!!
hello?Are you ready for these extracted features?
Please download the extracted feature files from Google Drive and place them in the same folder (e.g., the folder where you store your data: /home/user/project/data/).
https://drive.google.com/file/d/1DJJ8wP3g4yLT0ZFZ_-H4izJHAGJx2AJ_/view?usp=sharing
https://drive.google.com/file/d/1YyoWabWtAJuFI6ylMM220i_kh_0Nagv9/view?usp=sharing
Please follow the instructions below to modify the corresponding directory address:
Specify the path to img_path in the eval_emotion.yaml:
img_path: /home/user/project/data/first_frames
Specify the path to eval_file_path in the eval_emotion.yaml:
eval_file_path: /home/user/project/data/relative_test3_NCEV.txt
If you have further research on this dataset, please apply to download the dataset from the official MER2023 webpage.
Our team is currently preparing for the MER2024 competition. We aim to achieve good results on new tasks and datasets, proving the strong capabilities and robustness of our proposed Emotion-LLaMA. Recently, our focus will be on MER2024, and we will continue to update our progress. If you are interested in multimodal emotion recognition and understanding, you might want to follow this competition.
thanks for your share and comments,
but may be i also need this csv file,can you share it with me?
oh,by the way,i noticed that you mentioned “samplenew_00006611 125 sad -10”,so i wanna ask that what does “125” stand for. I know sad is emotion label,-10 is label too.
As indicated by the file name relative_test3_NCEV, N stands for the name of the video sample (samplenew_00006611), C represents the count of video frames (125, this variable is used elsewhere for hierarchical frame extraction. For example, when extracting VideoMAE features, this project sets an input of 16 images. The 125 video frames are sequentially divided into eight parts, and two frames are randomly selected from each part as input), and E and V represent the emotion labels (sad and -10).
thanks for your share and comments,
but may be i also need this csv file,can you share it with me?
https://drive.google.com/file/d/19Pc2YiPnIhcePISUrI313hUYEzXJH41o/view?usp=sharing
Thank you author!So,although you used Hubert-chinese-large pretrained modal for audio feature extraction,you still used english transcription for text input right?(Cause i noticed that the the most language of MER2023 dataset is Chinese)
Yes, because the audio in the video is in Chinese, we use the chinese-HuBERT-large model to extract the most relevant emotional features from the Chinese audio. However, the large model base (LLaMA-2) only supports English, so we use English text subtitles for emotion recognition and reasoning.
ok,i successfully reproduce this project. However, i want ask you some details about feature extraction of audio(hubert-chinese-large), is there any changes in your code, which comparing with offical code supported by MER2023?(Because i replaced audio feature that extracting on my own by using MER2023 tools,i found the acc could only get 0.41)
Ours audio features were extracted exactly according to the code provided on the MER2023 website. Please ensure you are using the chinese-hubert-large model, in UTT mode as specified in the official code, and not the English version of the HuBERT model.
Notably, you should use MER2023/feature_extraction/audio/extract_transformers_embedding.py and not /feature_extraction/audio/extract_transformers_embedding.py. We used the former (layer_ids = [-1]), whereas the latter was modified this year (layer_ids = [-4, -3, -2, -1]). If you have downloaded the MER2023 project, I recommend reproducing the test code for your HuBERT features extraction on the MER2023 dataset. Under normal circumstances, your MER2023-SEMI score should be around 0.85 (both the official baseline and our reproduction results are similar).
Ops,it‘s my fault. I used MER2024 tools for audio extraction. Now I am going to extract audio features by using MER2023 tools again. Thanks for your response.
Thanks for your response again!However,in your instruction,i can only find evaluation instruction without finturing insturction. So, if i want to finetuning your modal,what should i do?(if i have prepared audio, vidio, features, images, and transcription by using MER2023 tools)
Sorry for the delay in responding.
If you have successfully reproduced the results, you can proceed with training, but some detailed steps are involved. First, you need to prepare the audio and visual features (we didn't load all Encoders for end-to-end training to save GPU memory). Then, modify the corresponding file paths in /minigpt4/datasets/datasets and /train_configs. A tip is to search the entire project files for the keywords "FeatureFaceDataset" and "feature_face_caption" to find all the files you need to modify. You can also search for the keywords "VideoHere" and "FeatureHere" to see how to insert multimodal features into the instruction code, which will help you make modifications more flexibly.
Finally, execute the training command:
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc-per-node 4 train.py --cfg-path train_configs/minigptv2_finetune_featureface.yaml
thanks for your share and comments,
but may be i also need this csv file,can you share it with me?
https://drive.google.com/file/d/19Pc2YiPnIhcePISUrI313hUYEzXJH41o/view?usp=sharing
so much thanks。it's also useful for me。我现在已经迫不及待的想要尝试运行结果了。
Thanks for your detailed explanation. When I try to reproduce the result, I met the cuda out of memory error. It seems to load checkpoint in the first gpu. However, my gpu only has 16GB * 4 cuda memory. Could you tell me how to modify the code for multi-gpu loading?
Sorry, we haven't experimented with multi-GPU inference for large models. During inference, we typically use a single GPU, which requires around 30GB of memory. Your issue might be due to insufficient memory on a single GPU.
okay, thanks.
Hello, I would like to ask where I can get the two files of AU_filter_merge.json and 0512_target_smp_end.json, I don't seem to find it, can you please share it, thank you
Apologies for the confusion. To ensure clarity in the filenames, I have renamed the two files to MERR_coarse_grained.json and MERR_fine_grained.json. Below are the download links:
https://drive.google.com/drive/folders/1LSYMq2G-TaLof5xppyXcIuWiSN0ODwqG?usp=sharing
Recently, we have been organizing our code and data, and we will be releasing the updated data as open source. If you have any questions, feel free to reach out.
对于混淆,我们深表歉意。为了确保文件名的清晰性,我将这两个文件重命名为 MERR_coarse_grained.json 和 MERR_fine_grained.json。以下是下载链接:
https://drive.google.com/drive/folders/1LSYMq2G-TaLof5xppyXcIuWiSN0ODwqG?usp=sharing
最近,我们一直在整理我们的代码和数据,并将更新后的数据作为开源发布。如果您有任何问题,请随时与我们联系。
Okay, thank you so much, good luck
Hello, when running the code, I cannot find the file '/home/user/selected_face/face_emotion/transcription_en_all.csv'. Can you provide this file?
Hello, when running the code, I cannot find the file '/home/user/selected_face/face_emotion/transcription_en_all.csv'. Can you provide this file?
Hello author, I am trying to perform a zero-shot test of Emotion-LLama on a new dataset for emotion recognition task. I have extracted audio features (using HuBERT-large) using the script in MERBench, and temporal features using videomae-large. I am unsure of how to extract static facial expression features (as mentioned in the paper: ViT-structured model pre-trained by the MAE scheme [82] extracts static facial expression features) in order to achieve the results mentioned in the paper. Could you provide some advice?
The MAE and VideoMAE models we used are not the original pre-trained models; instead, they are model parameters that our team pre-trained using unsupervised mask recovery on the MER2023-SEMI dataset. Apologies for the inconvenience, as this part was not included in our open-source plan. We will organize and upload the model parameters and code by November 20th.
- If you wish to replicate a zero-shot test of Emotion-LLaMA on a new dataset, you can refer to the demo and set the corresponding features for MAE and VideoMAE as zero vectors.
- If you are interested in fine-tuning Emotion-LLaMA, you can use the native MAE or CLIP model as the Local Encoder to extract features.
but may be i also need this csv file,can you share it with me?