awesome-speech-emotion-recognition icon indicating copy to clipboard operation
awesome-speech-emotion-recognition copied to clipboard

😎 Awesome lists about Speech Emotion Recognition

Awesome Speech Emotion Recognition

Awesome GNU GPLv3 Maintenance GitHub last commit Visitors


  • What's New
  • Reviews
  • Databases
  • Developing
  • Training
  • Publishing
  • Learning
  • Maybe Useful

What's New

Year Month Database Title Topics
2024 February EURASIP Journal on Audio, Speech, and Music Processing Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources A systematic review of the literature on expressive speech synthesis models published within the last 5 years, with a particular emphasis on approaches based on deep learning
2024 February Information Fusion Emotion recognition and artificial intelligence: A systematic review (2014–2023) and research recommendations A systematic review of emotion recognition from different input signals (e.g, physical, physiological)


Year Database Title Topics
2023 Neurocomputing An ongoing review of speech emotion recognition A comprehensive review of most popular datasets, and current machine learning and neural networks models for SER
2023 Information Fusion Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions A review on multimodal fusion architectures
2022 Neural Computing and Applications Human emotion recognition from EEG-based brain-computer interface using machine learning: a comprehensive review Human emotion recognition using EEG-based brain signals and machine learning
2022 Wireless Personal Communications Survey of Deep Learning Paradigms for Speech Processing Machine learning techniques for speech processing
2022 Electronics Bringing Emotion Recognition Out of the Lab into Real Life: Recent Advances in Sensors and Machine Learning This work reviews progress in sensors and machine learning methods and techniques
2021 IEEE Access A Comprehensive Review of Speech Emotion Recognition Systems SER systems' varied design components/methodologies, databases
2021 Electronics A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism Extensive comparison of Deep Learning architectures, mainly on the IEMOCAP benchmark database
2021 Digital Signal Processing A survey of speech emotion recognition in natural environment A comprehensive survey of SER in the natural environment, various issues of SER in the natural environment, databases, feature extraction, and models
2021 Archives of Computational Methods in Engineering Survey on Machine Learning in Speech Emotion Recognition and Vision Systems Using a Recurrent Neural Network (RNN) A survey of deep learning algorithms in speech and vision applications and restrictions
2021 Applied Sciences Deep Multimodal Emotion Recognition on Human Speech: A Review An extensive review of the state-of-the-art in multimodal speech emotion recognition methodologies


Russell's circumplex model of affect [1] is a model of human emotion that posits that all emotions can be represented as points on a two-dimensional space, with one dimension representing valence (pleasantness vs. unpleasantness) and the other dimension representing arousal (activation vs. deactivation). Valence refers to the positive and negative degree of emotion and arousal refers to the intensity of emotion. Most categorical emotions used in SER databases are based on this model

A graphical representation of the circumplex model of affect with the horizontal axis representing the valence or pleasant dimension and the vertical axis representing the arousal or activation dimension

A graphical representation of the circumplex model of affect with the horizontal axis representing the valence or pleasant dimension and the vertical axis representing the arousal or activation dimension [2].

[1] Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178.
[2] Valenza, G., Citi, L., Lanatá, A. et al. Revealing Real-Time Emotional Responses: a Personalized Assessment based on Heartbeat Dynamics. Sci Rep 4, 4998 (2014).

Datasets for Emotion Recognition

Dataset Lang Size Type Emotions Modalities Resolution
AffectNet N/A ~450.000 subjects Natural Continuous valence/arousal values and categorical emotions: anger, contempt, disgust, fear, happiness, neutral, sadness, surprise Visual 425x425
Belfast Naturalistic Database Spanish 127 multi-cultural speakers of 298 emotional clips Natural Amusement, anger, disgust, fear, frustration, sadness, surprise Audio
Berlin Database of Emotional Speech (Emo-DB) German 5 male and 5 female speakers, with more than 500 utterances Acted Anger, boredom, disgust, fear/anxiety, happiness, neutral, sadness Audio Audio: 48kHz, downsampled to 16kHz
CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) English 1000 gender-balanced YouTube speakers, 23500 sentences Natural Sentiment: Negative, weakly negative, neutral, weakly positive, positive Emotions: Anger, disgust, fear, happiness, sadness, surprise Audio
French (Canadian) 12 actors, 6 males, and 6 females, 6 sentences Acted Anger, disgust, fear, happiness, neutral, sadness, surprise in two different intensities Audio Audio: 192kHz and 48kHz
English 91 actors,48 males, and 43 females, 12 sentences Acted Anger, disgust, fear, happy, neutral and sad / Emotional Intensity Audio
Audio: 16kHz
English (North American), Chinese (Mandarin) 10 English (5 males, 5 females) and 10 Chinese (5 males, 5 females) speakers, 700 utterances Acted Anger, happiness, neutral, sadness, surprise Audio
Audio: 16kHz
Formats: wav
Interactive Emotional Motion Capture (USC-IEMOCAP)
English A 12h multimodal and multispeaker (5 males and 5 females) database Acted
Anger, frustration, happiness, neutral, sadness as well as dimensional labels such as valence, activation and dominance Audio
Audio: 48kHz
Video: 120 fps
MELD: Multimodal EmotionLines Dataset English More than 13000 utterances from multiple speakers Natural Anger, disgust, fear, joy, neutral, non-neutral, sadness, surprise Audio
Audio: 16bit PCM
Formats: .wav
OMG-Emotion English 10 hours of YouTube videos around 1min long Natural Continuous valence/arousal values and categorical emotions: anger, disgust, fear, happiness, neutral, sadness, surprise Audio
Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)
English A database of emotional speech and song of 12 males and 12 females Acted Anger, disgust, calmness, fear, happiness, neutral, sadness, surprise Audio
Audio: 48kHz - 16bit
Video: 720p
Formats: .wav,.mp4
SEMAINE English 95 sessions of human-agent interactions Natural 4D Emotional space Audio
Surrey Audio-Visual Expressed Emotion (SAVEE)
English 4 male speakers x 480 utterances Acted Anger, disgust, fear, happiness, neutral, sadness, surprise Audio
Audio: 44.1kHz - mono - 16bit
Video: 256p - 60fps
Formats: .wav, .avi
SUSAS English Speech under stress corpus with more than 16000 utterances from 32 speakers (13 females, 19 males) Acted Ten stress styles such as speaking style, single tracking task, and Lombard effect domain Audio 8kHz, 8bit PCM

Datasets for Sound Classification

Database Year Type Resolution
2020 more than 210k videos for 310 audio classes N/A, 10sec long
2017 2.1 million sound clips from YouTube videos, 632 audio event classes N/A, 10sec long
2014 Urban sound excerpts sampling rate may vary from file to file, duration<=4sec
2000 Environmental audio recordings 44.1kHz, mono, 5sec long



  • C++
    • Essentia |   GitHub   |A C++ library for audio and music analysis, description, and synthesis, including Python bindings
    • openSMILE |   GitHub   |An open-source toolkit for audio feature extraction and classification of speech and music signals, including a C API with Python and C# bindings
    • Audio Toolbox | Provides tools for audio processing, speech analysis, and acoustic measurement
    • Covarep |   GitHub   | A Cooperative Voice Analysis Repository for Speech Technologies
  • Python Libraries
    • Aubio |   GitHub   | Free, open source library for audio and music analysis
    • Librosa |   GitHub   | A Python package for music and audio analysis
    • OpenSoundscape |   GitHub   | A Python utility library for analyzing bioacoustic data
    • Parselmouth |   GitHub   | A Pythonic interface to the Praat software
    • PyAudioAnalysis |   GitHub   | A Python library that provides a wide range of audio-related functionalities focusing on feature extraction, classification, segmentation, and visualization issues
    • Pydub |   GitHub   | Manipulate audio with a simple and easy high-level interface
    • SoundFile |   GitHub   | A python library for audio IO processing


  • Audacity |   GitHub   | Free, open source, cross-platform audio software
  • AudioGPT |   GitHub   | Solve AI tasks with speech, music, sound, and talking head understanding and generation in multi-round dialogues, which empower humans to create rich and diverse audio content with unprecedented ease (paper)
  • ESPNet |   GitHub   | ESPnet is an end-to-end speech processing toolkit covering end-to-end speech recognition, text-to-speech, speech translation, speech enhancement, speaker diarization, and spoken language understanding
  • Kaldi |   GitHub   | Kaldi is an automatic speech recognition toolkit
  • S3PRL |   GitHub   | A toolkit targeting for Self-Supervised Learning for speech processing. It supports three major features: i) Pre-training, ii) Pre-trained models (Upstream) collection, and iii) Downstream Evaluation
  • SpeechBrain |   GitHub   | A PyTorch speech and all-in-one conversational AI toolkit



Audio/Speech Data Augmentation



Name Impact Factor Review Method First-decision
Frontiers in Computer Science 1.039 Peer-review 13w
International Journal of Speech Technology 1.803 Peer-review 61d
Machine Vision and Applications 2.012 Peer-review 44d
Applied Accoustics 2.639 Peer-review 7.5w
Applied Sciences 2.679 Peer-review 17.7d
Multimedia Tools and Applications 2.757 Peer-review 99d
IEEE Sensors 3.301 Peer-review 60d
IEEE Access 3.367 Binary Peer-review 4-6w
Computational Intelligence and Neuroscience 3.633 Peer-review 40d
IEEE/ACM Transactions on Audio, Speech and Language Processing 3.919 Peer-review N/A
Neurocomputing 5.719 Peer-review 74d
IEEE Transactions on Affective Computing 10.506 Peer-review N/A


Name Date Location More
IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) June 2023 Canada GitHub
International Conference on Acoustics, Speech, & Signal Processing (ICASSP) June 2023 Greece GitHub
International Speech Communication Association - Interspeech (ISCA) August 2023 Ireland GitHub
European Signal Processing Conference (EUSIPCO) September 2023 Finland :heavy_minus_sign:
Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE) September 2023 Finland GitHub
International Society for Music Information Retrieval Conference (ISMIR) November 2023 Italy GitHub
Name Date Location More
International Conference on Acoustics, Speech, & Signal Processing (ICASSP) April 2024 Seoul, Korea GitHub
IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) June 2024 Seattle, WA, USA :heavy_minus_sign:
International Conference on Machine Learning (ICML) July 2024 Vienna, Austria :heavy_minus_sign:
European Signal Processing Conference (EUSIPCO) August 2024 Lyon, France :heavy_minus_sign:
International Speech Communication Association - Interspeech (ISCA) September 2024 Kos, Greece :heavy_minus_sign:
Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE) October 2024 Tokyo, Japan GitHub
International Society for Music Information Retrieval Conference (ISMIR) November 2024 San Fransisco, CA, USA :heavy_minus_sign:
Conference and Workshop on Neural Information Processing Systems (NeurIPS) December 2024 Vancouver, Canada :heavy_minus_sign:


Maybe Useful

Other Awesome Material


A picture(s) is worth a thousand words! A 2-min visual example of how we communicate emotions, our perceptions, the role of subjectivity and what is effective listening.

Are emotions consistent?

What about the dynamics of the context to our decisions and emotional wellness?