inferno icon indicating copy to clipboard operation
inferno copied to clipboard

Releasing the VideoEmotionRecognition module

Open HarryXD2018 opened this issue 11 months ago • 5 comments

Hi 👋 I noticed that the video emotion recognition is still under construction. Is there any plan to release this module soon? Thanks!

HarryXD2018 avatar Mar 22 '24 12:03 HarryXD2018

The video emotion recognition module has been available since the release of EMOTE. I have not created a demo for it, since it was more if a means to and end to build EMOTE.

The training code is there: https://github.com/radekd91/inferno/blob/master/inferno_apps/VideoEmotionRecognition/training/train_emote_classifier.py

If you want to use it for something else. You just have to instantiate the provided model. These pieces of code should give you an idea on how it's used: https://github.com/radekd91/inferno/blob/master/inferno/layers/losses/VideoEmotionLoss.py#L79 https://github.com/radekd91/inferno/blob/master/inferno/layers/losses/VideoEmotionLoss.py#L39

I will probably not have time to create a friendier demo for this soon but am willign to answer questions.

radekd91 avatar Mar 24 '24 23:03 radekd91

Many thanks for the reply and your assistance. I didn't notice that the pretrained model would be downloaded in another script.

https://github.com/radekd91/inferno/blob/53beb5d3d4a9b280f0f9076c59444707e595fbd6/inferno_apps/TalkingHead/download_assets.sh#L50

Closing this issue for now.

HarryXD2018 avatar Mar 25 '24 04:03 HarryXD2018

I'm sorry that I have to reopen the repo. I'm trying to annotate a dataset automatically, and I'm not sure of the pipeline, which is:

  1. interpolate the video to 25 fps,
  2. crop the facial region by FAN,
  3. obtain the per-frame 2024d EMOCA exp ResNet feature,
  4. feed into the transformer layer and predict the emotion category and intensity.

I guess I can take the demo from the EmotionRecognition as a reference as well.

One more question I am very concerned about is the generalization of this model. I noticed that it achieves 90%+ ACC on the MEAD dataset testing set; if the model has considerable generalization ability, it would be a suitable pre-trained model for automatically labeling a talking face dataset.

HarryXD2018 avatar Mar 27 '24 12:03 HarryXD2018

You're right, the video emotion classifier takes a sequence of emotion features produced by the Emotion Recognition Network trained for EMOCA and the 4 steps you mention are accurate.

Regarding generalization to other data, this is a good question. The emotion ResNet itself was trained on AffectNet, an in-the-wild dataset. The video classifier itself was only trained on features extracted from MEAD (and frontal videos only). I suspect the classifier will work decently well on videos where the face is looking into the camera and less good on others. However, I have not verified this.

radekd91 avatar Mar 28 '24 03:03 radekd91

Thanks again. I will verifiy it and will let you know the result. : )

HarryXD2018 avatar Mar 28 '24 04:03 HarryXD2018