inferno
inferno copied to clipboard
Releasing the VideoEmotionRecognition module
Hi 👋 I noticed that the video emotion recognition is still under construction. Is there any plan to release this module soon? Thanks!
The video emotion recognition module has been available since the release of EMOTE. I have not created a demo for it, since it was more if a means to and end to build EMOTE.
The training code is there: https://github.com/radekd91/inferno/blob/master/inferno_apps/VideoEmotionRecognition/training/train_emote_classifier.py
If you want to use it for something else. You just have to instantiate the provided model. These pieces of code should give you an idea on how it's used: https://github.com/radekd91/inferno/blob/master/inferno/layers/losses/VideoEmotionLoss.py#L79 https://github.com/radekd91/inferno/blob/master/inferno/layers/losses/VideoEmotionLoss.py#L39
I will probably not have time to create a friendier demo for this soon but am willign to answer questions.
Many thanks for the reply and your assistance. I didn't notice that the pretrained model would be downloaded in another script.
https://github.com/radekd91/inferno/blob/53beb5d3d4a9b280f0f9076c59444707e595fbd6/inferno_apps/TalkingHead/download_assets.sh#L50
Closing this issue for now.
I'm sorry that I have to reopen the repo. I'm trying to annotate a dataset automatically, and I'm not sure of the pipeline, which is:
- interpolate the video to 25 fps,
- crop the facial region by FAN,
- obtain the per-frame 2024d EMOCA exp ResNet feature,
- feed into the transformer layer and predict the emotion category and intensity.
I guess I can take the demo from the EmotionRecognition
as a reference as well.
One more question I am very concerned about is the generalization of this model. I noticed that it achieves 90%+ ACC on the MEAD dataset testing set; if the model has considerable generalization ability, it would be a suitable pre-trained model for automatically labeling a talking face dataset.
You're right, the video emotion classifier takes a sequence of emotion features produced by the Emotion Recognition Network trained for EMOCA and the 4 steps you mention are accurate.
Regarding generalization to other data, this is a good question. The emotion ResNet itself was trained on AffectNet, an in-the-wild dataset. The video classifier itself was only trained on features extracted from MEAD (and frontal videos only). I suspect the classifier will work decently well on videos where the face is looking into the camera and less good on others. However, I have not verified this.
Thanks again. I will verifiy it and will let you know the result. : )